Tracker rejects ebooks where OPF file contains BOM
I've been having some issues with Gnome Books and tracked one of them down to Tracker missing metadata (GNOME/gnome-documents#18).
I've downloaded some ePub books from a Humble Book Bundle. Most of them are listed correctly, but others are missing titles and just show the file name. If I run Tracker Extract on a file then I get the following:
tracker extract Documents/ebooks/Novels/WH_theendtimes_book01_thereturnofnagash.epub
Could not get EPUB 'OEBPS/content.opf' file: Error on line 1 char 1: Document must begin with an element (e.g. <book>)
g_object_ref: assertion 'G_IS_OBJECT (object)' failed
g_object_unref: assertion 'G_IS_OBJECT (object)' failed
file:///home/ibboard/Documents/ebooks/Novels/WH_theendtimes_book01_thereturnofnagash.epub: No metadata or extractor modules found to handle this file
After a lot of poking then I've found that that particular file has a BOM. Tracker's ePub parsing appears to use a GMarkupParser, which doesn't appear to support a BOM (even thought it's apparently valid-but-optional in general, except in pure ANSI XML).
I don't know whether this is actually an up-stream glibc bug or whether Tracker can fix it. Failed parsing of commercial ePubs gives the impression of a bug, even if it's due to strict handling of a standard.
Version: tracker-2.1.4, openSUSE Tumbleweed
"Broken" content.opf file with BOM: content.opf