Optical Character Recognition and/or Image Processing Support

Hi all,

Descritption

we all have now and then a document, which is skewed and not searchable for text - and here Optical Character Recognition comes in handy. Typically we get such documents from scans or old academic or professional papers, books and manuals.

The Problem

Not searchable larger documents are quite difficult to digest and in an ideal world, Papers would not have to deal with OCR. Alas, it happens often, often, that I get sent a document with anything - not even metadata bookmarks - which I wish to read as soon as possible.

Expectations

I open Papers and find an option_{1} to let

image processing, if needed, and
ocr

run in the background, while I can already read papers - and as soon, as image processing and ocr are done and ready, Papers tells me and lets me preview the image processed version, which I may wish to keep or discard_{2}

Languages

The language of the document has to be provided by the user, when choosing to let Papers do ocr
An English language model should be preinstalled
Upon installation of Papers the system language should be installed
Papers would have to install and load the relevant model for other languages, if needed

Caveat

{0} Depending on the quality of the document, image processing may be required anyway to get decent ocr results.
{1} The user should be warned, if an ocr layer is already present
{2} the image processing alters the document and may make things worse, so here a proper Save or Discard dialog is warranted, see #63
{3} in view of underpowered devices, the image processing and ocr job may take up resources and time. To mitigate
- limit the resources
- keep on working in the background
- choose efficient tools, optimized for non-handwritten text documents
- choose sufficiently small models

The Tools

Classical relevant tools are

Tesseract - a classical and with version 4 also a neural networkbased OCR engine
Leptonica - image processing tool

Related Projects

free as in free speech are

paperwork - a GNOME-y document managing app
unpaper - a post-processing tool for scanned sheets of paper, relying on tesseract for OCR
ocrmypdf - command line all-in one solution for PDFs
pdfsandwich - the good ol' reliable thing from many years ago
easypdf - Ready-to-use OCR with 80+ supported languages
ocrs - ocrs is a recent Rust library and CLI tool for extracting text from images

This clearly needs design, but I gather, that @bertob could be interested

Edited Feb 28, 2024 by Martin Mayer