PDF indexing can be expensive, limits are required
Indexing a directory of many PDFs can lead to continued high CPU usage.
The fundamental issue is that the PDF format is for printing, not storing text. In some cases the text is stored as string text inside the file which is quick to extract. In other cases, individual glyphs are placed on the page and the Poppler library has to group them together using an expensive algorithm to convert them back to text.
Examples of difficult PDFs:
- https://github.com/GerHobbelt/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/samples/nonfree/i1040nr.pdf - 0.6 seconds to decode
- http://ca.mouser.com/catalog/catalogcad/646/dload/pdf/MOUSER.pdf (linked from evince#190) - 3.6 seconds to decode
This was tested with Poppler 22.08.
Many example PDFs are available at https://github.com/pdf-association/pdf-corpora
Part of #108