Work on hocr instead of txt from OCR engines
Submitted by Thomas Koch
Link to original bug (#675884)
Description
This issue is already on the roadmap and in the old google code issue tracker.
One advantage of hocr would be, that a later export to pdf or djvu could place the OCR'd text directly at the right position in the image and allow copying from the resulting file.
Just note that tesseract[1] and cuneiform[2] may produce html files with invalid utf-8 or control characters. gscan2pdf and ocrudjvu[3] already worked around this bug.
[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=690 [2] https://bugs.launchpad.net/cuneiform-linux/+bug/585418 [3] https://bitbucket.org/jwilk/ocrodjvu/changeset/997dabca28a2
Version: git master