Work on hocr instead of txt from OCR engines

Submitted by Thomas Koch

Description

This issue is already on the roadmap and in the old google code issue tracker.

One advantage of hocr would be, that a later export to pdf or djvu could place the OCR'd text directly at the right position in the image and allow copying from the resulting file.

Just note that tesseract[1] and cuneiform[2] may produce html files with invalid utf-8 or control characters. gscan2pdf and ocrudjvu[3] already worked around this bug.

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=690 [2] https://bugs.launchpad.net/cuneiform-linux/+bug/585418 [3] https://bitbucket.org/jwilk/ocrodjvu/changeset/997dabca28a2

Version: git master