ChangeLog 5.5 KB
Newer Older
Jerome Flesch's avatar
Jerome Flesch committed
1 2 3 4
09/04/2018 - 0.5.3:
- Really fix tesseract 4.0 support (thanks to David Martin)
- Tests: switch from nose to pytest (thanks to Elliott Sales de Andrade)

Jerome Flesch's avatar
Jerome Flesch committed
5 6 7 8
25/07/2018 - 0.5.2:
- Fix tesseract 4.0 support: Use option '--psm' instead of '-psm'
- tesseract.detection_orientation(): Fix exception generation

Jerome Flesch's avatar
Jerome Flesch committed
9 10 11 12 13 14 15 16 17 18
01/03/2017 - 0.5.1:
- libtesseract/Windows: Add possible DLL names for libtesseract
- libtesseract: Keep track of library-loading errors in
  pyocr.libtesseract.lib_load_errors (useful for debugging)
- Build method has been changed: Use now "make install" instead of
  "python3 ./setup.py install"
- cosmetic: builders/WordHTMLParser: Message "OCR confidence not found"
  floods the logs when working with old documents --> switch to debug
  instead of info.

Jerome Flesch's avatar
Jerome Flesch committed
19 20
14/12/2017 - 0.5:
- Tesseract/Libtesseract + LineBoxBuilder: Add confidence scores to
21 22 23 24 25 26 27 28
  every word boxes and to hOCR files (thanks to Adriano Pagano)
- Tesseract 4 (shell): Add '--oem 0' to specify legacy model when doing
  orientation detection as orientation detection does not work yet with
  Tesseract 4 (thanks to Adriano Pagano)
- Libtesseract: Fix multi-language support
- Tesseract (shell) + Windows: Never let the cmd window appear
- Libtesseract: Implements image_to_pdf() (thanks to Marian Skrip)
- Libtesseract: Hide debug messages (thanks to Ashish Kulkarni)
Jerome Flesch's avatar
Jerome Flesch committed
29

Jerome Flesch's avatar
Jerome Flesch committed
30 31 32 33 34 35 36 37 38 39 40 41
13/05/2017 - 0.4.7:
- Tesseract 4.00.00alpha:
  - Version parsing: Ignore suffix (so '4.00.00alpha' == (4, 0, 0))
  - Libtesseract: Load libtesseract.so.4 instead of libtesseract.so.3 if
    available
- Support for Tesseract 3.05.00:
  - Builders: Split field 'tess_conf' into 'tess_flags' and 'tess_conf'
  - Libtesseract: If available, use TessBaseAPIDetectOrientationScript()
    instead of TessBaseAPIDetectOS
- Libtesseract: Workaround: Prevents possible segfault in image_to_string()
  when the target language is not available

Jerome Flesch's avatar
Jerome Flesch committed
42 43 44
26/01/2017 - 0.4.6:
- hOCR outputs: Generate valid XHTML files

Jerome Flesch's avatar
Jerome Flesch committed
45 46 47 48 49 50 51
10/01/2017 - 0.4.5:
- Clean up exceptions raised when OCR fails:
  - Now, all tools raise only exceptions inheriting from
    pyocr.PyocrException
  - There is now one and only one TesseractError (shared between
    pyocr.libtesseract and pyocr.tesseract)

Jerome Flesch's avatar
Jerome Flesch committed
52 53 54
08/12/2016 - 0.4.4:
- Fix Python 2.7 support (broken import)

Jerome Flesch's avatar
Jerome Flesch committed
55 56 57 58
06/12/2016 - 0.4.3:
- (temporary) Use tesseract-sh by default instead of libtesseract. Some
  people have reported crashes with Paperwork+libtesseract. It needs more
  stress-testing
59 60 61
- DigitBuilder is now available in 'pyocr.builders' (can be used
  with libtesseract and cuneiform)
- New builder: DigitLineBoxBuilder
Jerome Flesch's avatar
Jerome Flesch committed
62 63 64 65
- Windows: Fix pyinstaller packaging suport: env variable TESSDATA_PREFIX
  wasn't set correctly
- Windows: Tesseract-sh: Prevent CMD windows from appearing

Jerome Flesch's avatar
Jerome Flesch committed
66
05/10/2016 - 0.4.2:
Jerome Flesch's avatar
Jerome Flesch committed
67 68 69 70 71 72 73
* Tesseract: orientation detection: Ignore errors printed by libleptonic
  on stderr (thanks to TeisD)
* Tesseract: Fix support of dev builds (thanks to Fjup)
* Libtesseract: Fix support of dev builds (thanks to Jakub Semerák)
* Tesseract: Use '--list-langs' to get the available languages instead of
  looking for the data directory (thanks to Bernhard Liebl)

Jerome Flesch's avatar
Jerome Flesch committed
74 75 76 77 78 79 80 81
06/04/2016 - 0.4.1:
* Disable 'libtesseract' with Tesseract <= 3.03. It tends to segfault.
  Libtesseract: Disable it with Tesseract <= 3.03. It tends to segfault.
  Note: the segfault may not actually be related to Libtesseract. It may be due to other things in Debian stable (jessie).
  Anyway, Paperwork cannot work on Debian stable because of that --> disabled just to be safe


13/03/2013 - 0.4.0:
82 83 84 85 86 87 88 89
* New module: 'libtesseract'. Use the C API of Tesseract for OCR.
  This module is more efficient and cleaner than the old 'tesseract' module
  (no more fork + exec + sh, less image manipulation, etc).
  Note that with this module the images are just loaded and uncompressed
  by Pillow. With 'tesseract', they were loaded, uncompressed, re-compressed
  and saved by Pillow, then be reloaded by Leptonica. So the results may
  vary slightly.
* Tesseract: Add support for Win32
Jerome Flesch's avatar
Jerome Flesch committed
90
* Tesseract: Fix orientation detection for version >= 3.04.01
91 92 93


0.3.1:
Jerome Flesch's avatar
Jerome Flesch committed
94 95 96 97 98 99 100 101 102
* tesseract.detect_orientation(): Use a temporary file instead of stdin
  to transmit the image to Tesseract. Tesseract 3.04 doesn't support
  stdin + "-psm 0" (regression ?)
* tesseract.detect_orientation(): Improve output parsing reliability
* optim: Avoid unnecessary convert to RGB and allow using image formats
  different from PNG
* TextBuilder + Cuneiform: add extra settings for Cuneiform
  (cuneiform_dotmatrix, cuneiform_fax=False, cuneiform_singlecolumn)

103 104

0.3.0:
Jerome Flesch's avatar
Jerome Flesch committed
105 106 107
* New API: pyocr.<tool>.can_detect_orientation() and
  pyocr.<tool>.detect_orientation()

108 109

0.2.4:
Jerome Flesch's avatar
Jerome Flesch committed
110 111 112
* Tesseract : add digit-only support
* Tesseract : add support for Tesseract subsets of layout analysis (-psm)

113 114

0.2.3:
Jerome Flesch's avatar
Jerome Flesch committed
115 116 117 118 119 120
* Strip the alpha channel from images before running the OCR. It's basically
  useless and can prevent the tool from working correctly.
* Make hOCR parsing more resistant (handle extra data around box positions)
* Fix: Take into account that new versions of Tesseract uses the file
  extension .hocr instead of .html

121 122

0.2.2:
Jerome Flesch's avatar
Jerome Flesch committed
123 124 125
* Fix Python 3 support
* Add support for Tesseract on Heroku

126 127

0.2.1:
Jerome Flesch's avatar
Jerome Flesch committed
128
* Make it possible to use 'import pyocr' instead of 'from pyocr import pyocr'.
Jerome Flesch's avatar
Jerome Flesch committed
129
  'from pyocr import pyocr' still works but is obsolete.
Jerome Flesch's avatar
Jerome Flesch committed
130 131 132
* Fix dependency list: depends on Pillow (it's untested with PIL)
* Fix pyocr.VERSION

133 134

0.2.0:
135 136
* Python 3.x support

137 138

0.1.2:
Jerome Flesch's avatar
Jerome Flesch committed
139 140
* Tesseract: Fix version parsing
* Tesseract: Fix Tesseract 3.02.01's hOCR format support
141

142 143

0.1.1:
144 145 146
* hOCR: Parse lines as well as words
* tesseract.get_available_languages() : Fix fedora support
* Fix UTF-8 support