• Sam Thursfield's avatar
    extract/msoffice-xml: Treat zero-length strings as unset properties · a1e766cd
    Sam Thursfield authored
    The MS Office extractor has been producing stuff like this:
    
        <file:///home/sam/Downloads/spreadsheet.xls> nie:comment "" ;
          nie:contentLastModified "2016-06-13T14:19:50Z" ;
          nie:contentCreated "2016-05-14T10:17:05Z" ;
          nie:plainTextContent "..." ;
          nie:subject "" ;
          a nfo:PaginatedTextDocument ;
          nie:title "" .
    
    This breaks queries which use COALESCE to do things like this:
    
        SELECT COALESCE(?nie_title, ?filename) as ?title
    
    If ?nie_title is unset then ?title will be set to the contents of
    ?filename; but if ?nie_title is present and set to an empty string then
    ?title will set to that empty string, which is not at all useful.
    
    The extractor will now ignore zero-length strings. Rather than
    using strlen() (which has to search to the end of the string)
    we just check if the first byte is 0.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=788298
    a1e766cd
tracker-extract-msoffice-xml.c 30.1 KB