OCRFeeder crashes on save and export of Unicode ’ right single quotation mark
@skierpage
Submitted by skierpage Link to original bug (#766904)
Description
I installed the OCRFeeder package in 64-bit Kubuntu 16.04 and asked it to recognize a scanned PDF. It looks great in the text properties pane, but when I save the project or export as ODT, OCRFeeder crashes. It seems to crash processing apostrophes in the PDF that were correctly recognized as the Unicode U+2019 right single quotation mark, e.g. (copied from the Text pane) settlors’ intention
This sounds like bug 765847, so I cloned and built git latest. But running that I still get crashes exporting as ODT, exporting as Text, or saving the project (see truncated tracebacks below). I'm unable to save any output from this fine program, hence severity=critical. In two cases Python's error message knows the input character is the Unicode string u'\u2019' (the correct code point for ’), as if something in the chain is attempting an unnecessary conversion.
Expected results: handle the strings returned by OCR library (I believe OCRFeeder is using libtesseract3 Version: 3.04.01-4 on my system) in all cases.
I can provide the full tracebacks if you want, and can try to create a PDF of a snippet of the original scan.
File > Export... as ODT get crash:
Traceback (most recent call last): File "/home/skierpage/programs/lib/python2.7/site-packages/ocrfeeder/studio/studioBuilder.py", line 303, in exportDialog ... File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 636: ordinal not in range(128)
File > Export... as "Texto simples" [sic, an untranslated string?], get crash:
Traceback (most recent call last): File "/home/skierpage/programs/lib/python2.7/site-packages/ocrfeeder/studio/studioBuilder.py", line 303, in exportDialog ... File "/home/skierpage/programs/lib/python2.7/site-packages/ocrfeeder/feeder/documentGeneration.py", line 361, in addText self.text += unicode(newText, 'utf-8') TypeError: decoding Unicode is not supported
File > Save As..., get crash:
Traceback (most recent call last): File "/home/skierpage/programs/lib/python2.7/site-packages/ocrfeeder/studio/studioBuilder.py", line 364, in saveProjectAs self.saveProject() ... File "/home/skierpage/programs/lib/python2.7/site-packages/ocrfeeder/studio/project.py", line 78, in convertToXml text = unicode(str(item), 'utf-8') UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 636: ordinal not in range(128)
Version: git master