Searchable pdf contains garbage instead of recognized text
Submitted by gre..@..il.com
Link to original bug (#672891)
Description
-
Download attached .tif (5 Mb!)
-
"Add image" in ocrfeeder.
-
OCR russian letters with tesseract (use "$IMAGE $FILE > /dev/null 2> /dev/null -l rus; cat $FILE.txt; rm $FILE $FILE.txt" as engine arguments)
-
Choose Export > PDF in the menu, than choose "Searchable PDF" and save the file.
-
Open the resulting pdf in adobe reader, select the text with a mouse and copy-paste it into a text editor. You should see something like
s ssssssss sssss sssssssssss ssssss s ssssssss sssssssss sssssss ssss ssss, sssssssssss ssssssssssss sssss sssssss s sssssss sssssss ssssss ssssssss sssss ss ss ssssssssss sssss sssssss ssss sssss sssss ssssss sssss ssssss sssssss, s ssssss ss sss sss sss sssss ssssss s ssssssssssssss ssssssssss sss sssssssss sss ssssssssss ssssssssssssss ssssssss, sssssss sssssss sssssssssss ssssssssss ssssssss ssssssss sssssssss sss s sssss sssssss s ssssssssss sssss sssss sss s sssssssss ssssss ssss sssssss sssss sssss sssssss ssssss sssssssssssss sssssss sssssss s ssss ssssssssssss sssss ss s ssssss sssss ssssss sssssss ssssss sss s.‚ sssssssss sssssssss ssss ssss sssss ssssss, sssssssssss s sssssss sssssssss ssssssssssssss sssssssss
instead of the russian text.
Saving to ODT works good, recognized text is not lost.
Version: git master