pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n")
Created by: ddddavidmartin
Good day,
I'm using pyocr through Paperless on a Ubuntu setup. I'm using the tesseract-ocr PPA [0] and on the latest version [1] pyocr throws an error.
[0]
cat /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr-artful.list
deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu artful main
[1]
tesseract --version
tesseract 4.0.0-beta.1-302-g3aa9
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.3.0
Traceback:
littlebig@littlebig:~/Dev/paperless$ python3 /home/littlebig/Dev/paperless/src/manage.py document_consumer
Starting document consumer at /home/littlebig/paperless_consumption_dir with inotify
Parsers available: RasterisedDocumentParser
Consuming /home/littlebig/paperless_consumption_dir/BRW90CDB68D60F5_000798.pdf
Processing sheet #1: /tmp/paperless/paperless-b5bgnwtm/convert-0000.pnm -> /tmp/paperless/paperless-b5bgnwtm/convert-0000.unpaper.pnm
[pgm_pipe @ 0x55cbcbdfb980] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55cbcbe00140] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55cbcbe00140] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 290, in image_to_string
return ocr.image_to_string(f, lang=lang)
File "/home/littlebig/.local/lib/python3.6/site-packages/pyocr/tesseract.py", line 367, in image_to_string
raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/littlebig/Dev/paperless/src/manage.py", line 18, in <module>
execute_from_command_line(sys.argv)
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
utility.execute()
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 98, in handle
self.loop_inotify(mail_delta)
File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 131, in loop_inotify
self.loop_step(mail_delta)
File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 123, in loop_step
self.file_consumer.consume_new_files()
File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 107, in consume_new_files
if not self.try_consume_file(file):
File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 145, in try_consume_file
date = parsed_document.get_date()
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
text = self.get_text()
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
self._text = self._get_ocr(images)
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
littlebig@littlebig:~/Dev/paperless$
Has anyone else come across this? Thanks!