Skip to content
  • aszlig's avatar
    Add support for Tesseract version 3.05.00 · d3852693
    aszlig authored
    This is a bit more involved, because Tesseract 3.05.00 comes not only
    with improvements but also with a few quirks we need to deal with.
    
    The first quirk is that the order arguments of the `tesseract' command
    now matters and the list of configurations has to be at the end of the
    command line. So we add a new attribute tesseract_flags to the
    BaseBuilder class that contains a list of all the flags to pass to
    `tesseract', the tesseract_configs attribute however remains pretty much
    the same but now only really contains a list of configs instead of being
    mixed with flag arguments.
    
    Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00
    now requires. Leptonica has special handling of files that reside in
    /tmp and assumes that it's an internal temporary file of Leptonica. In
    order to deal with it, we now run Tesseract in a temporary directory,
    which contains the input/output files and use the relative name of these
    files because Leptonica only searches for path names beginning with
    /tmp.
    
    Fortunately the last item we need to address is not really a quirk, but
    an API change. In Tesseract 3.05.00 there is now a new function called
    TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults
    object anymore but now allows to pass the values we're interested in
    directly by reference. We need to use this new function because the old
    function TessBaseAPIDetectOS() now *always* returns false.
    
    Ran the test suite successfully with Python 3.5 and both Tesseract
    3.04.01 and 3.05.00 except the following tests, which also didn't
    succeed prior to this commit:
    
     * cuneiform:TestTxt.test_basic
     * cuneiform:TestTxt.test_european
     * cuneiform:TestTxt.test_french
     * cuneiform:TestWordBox.test_basic
     * cuneiform:TestWordBox.test_european
     * cuneiform:TestWordBox.test_french
     * libtesseract:TestBasicDoc.test_basic
     * libtesseract:TestDigitLineBox.test_digits
     * libtesseract:TestLineBox.test_japanese
     * libtesseract:TestTxt.test_japanese
     * libtesseract:TestWordBox.test_japanese
     * tesseract:TestDigitLineBox.test_digits
     * tesseract:TestTxt.test_japanese
    
    The failure of these test cases is probably related to issue #52
    
    , but
    from looking at the failures it doesn't seem to be related to this
    change anyway.
    
    Signed-off-by: default avataraszlig <aszlig@redmoonstudios.org>
    d3852693