Tesseract C API : Digits-only (DigitBuilder)
Created by: bennguvaye
Hello,
I'd like to use tesseract with a numerical input, but as it is this is only possible with the tesseract command line tool and its DigitBuilder
, since f36f2492
However, this looks easy enough to implement with the C API too, with a new function in libtesseract/tesseract_raw.py :
def set_numeric_only(handle) :
global g_libtesseract
assert(g_libtesseract)
g_libtesseract.TessBaseAPISetVariable(
ctypes.c_void_p(handle),
b"classify_bln_numeric_mode",
b"1"
)
The most conservative way would be to use it in a new builder subclass in libtesseract/_init_.py , in the same was as for tesseract.py.
But I think it might be better to move this to image_to_string
both in
libtesseract/_init_.py and tesseract.py, with a new option, like it's done for choosing the language, since from what I understand builders should be more for choosing the format of the output.
I am not too familiar with github, ctypes, or pyocr, so sorry if I'm misunderstanding the code or doing something wrong.
Thank you for your work on this package, Regards
PS : It looks like the C API also offers possibilities for getting confidence scores for words, which might be interesting to get to a Builder.