Optical Character Recognition and/or Image Processing Support
Hi all,
Descritption
we all have now and then a document, which is skewed and not searchable for text - and here Optical Character Recognition comes in handy. Typically we get such documents from scans or old academic or professional papers, books and manuals.
The Problem
Not searchable larger documents are quite difficult to digest and in an ideal world, Papers would not have to deal with OCR. Alas, it happens often, often, that I get sent a document with anything - not even metadata bookmarks - which I wish to read as soon as possible.
Expectations
I open Papers and find an option_{1} to let
- image processing, if needed, and
- ocr
run in the background, while I can already read papers - and as soon, as image processing and ocr are done and ready, Papers tells me and lets me preview the image processed version, which I may wish to keep or discard_{2}
Languages
- The language of the document has to be provided by the user, when choosing to let Papers do ocr
- An English language model should be preinstalled
- Upon installation of Papers the system language should be installed
- Papers would have to install and load the relevant model for other languages, if needed
Caveat
- {0} Depending on the quality of the document, image processing may be required anyway to get decent ocr results.
- {1} The user should be warned, if an ocr layer is already present
- {2} the image processing alters the document and may make things worse, so here a proper
Save or Discard
dialog is warranted, see #63 - {3} in view of underpowered devices, the image processing and ocr job may take up resources and time. To mitigate
- limit the resources
- keep on working in the background
- choose efficient tools, optimized for non-handwritten text documents
- choose sufficiently small models
The Tools
Classical relevant tools are
- Tesseract - a classical and with version 4 also a neural networkbased OCR engine
- Leptonica - image processing tool
Related Projects
free as in free speech are
- paperwork - a GNOME-y document managing app
- unpaper - a post-processing tool for scanned sheets of paper, relying on tesseract for OCR
- ocrmypdf - command line all-in one solution for PDFs
- pdfsandwich - the good ol' reliable thing from many years ago
- easypdf - Ready-to-use OCR with 80+ supported languages
- ocrs - ocrs is a recent Rust library and CLI tool for extracting text from images
This clearly needs design, but I gather, that @bertob could be interested