README.markdown 4.67 KB
Newer Older
Jerome Flesch's avatar
Jerome Flesch committed
1 2
# PyOCR

3
PyOCR is an optical character recognition (OCR) tool wrapper for python.
Jerome Flesch's avatar
Jerome Flesch committed
4 5 6
That is, it helps using OCR tools from a Python program.

It has been tested only on GNU/Linux systems. It should also work on similar
7
systems (*BSD, etc). It may or may not work on Windows, MacOSX, etc.
Jerome Flesch's avatar
Jerome Flesch committed
8

9
PyOCR can be used as a wrapper for google's
Jerome Flesch's avatar
Jerome Flesch committed
10 11 12 13 14 15
[Tesseract-OCR](http://code.google.com/p/tesseract-ocr/) or Cuneiform.
It can read all image types supported by
[Pillow](https://github.com/python-imaging/Pillow), including jpeg, png, gif,
bmp, tiff, and others. It also support bounding box data.


16 17 18 19 20 21
## Supported OCR tools

* Libtesseract (C API)
* Tesseract (fork + exec)
* Cuneiform (fork + exec)

Jerome Flesch's avatar
Jerome Flesch committed
22 23 24 25 26 27 28 29 30 31
## Features

* Support all the image formats supported by [Pillow](https://github.com/python-imaging/Pillow)
* As output, can provide a simple string or boxes (position + string for each word and line)
* Can focus on digits only (Tesseract only)
* Can save and reload boxes in hOCR format

## Limitations

* hOCR: Only a subset of the specification is supported. For instance, pages and paragraph positions are not stored.
32

Jerome Flesch's avatar
Jerome Flesch committed
33
## Usage
34 35 36

### Initialization

37 38 39 40 41 42 43 44 45 46 47
```Python
from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
Jerome Flesch's avatar
Jerome Flesch committed
48
# The tools are returned in the recommended order of usage
49 50
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
Jerome Flesch's avatar
Jerome Flesch committed
51
# Ex: Will use tool 'libtesseract'
52 53 54 55 56 57

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
58 59 60 61
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
```
62

63 64 65
### Image to text

```Python
66 67 68 69 70
txt = tool.image_to_string(
    Image.open('test.png'),
    lang=lang,
    builder=pyocr.builders.TextBuilder()
)
71

72 73
word_boxes = tool.image_to_string(
    Image.open('test.png'),
Yada's avatar
Yada committed
74
    lang="eng",
75 76
    builder=pyocr.builders.WordBoxBuilder()
)
77

78
line_and_word_boxes = tool.image_to_string(
79
    Image.open('test.png'), lang="fra",
80 81
    builder=pyocr.builders.LineBoxBuilder()
)
Jerome Flesch's avatar
Jerome Flesch committed
82

Jerome Flesch's avatar
Jerome Flesch committed
83
# Digits - Only Tesseract (not 'libtesseract' yet !)
84 85 86 87 88
digits = tool.image_to_string(
    Image.open('test-digits.png'),
    lang=lang,
    builder=pyocr.tesseract.DigitBuilder()
)
89 90 91 92
```

Argument 'lang' is optionnal. The default value depends of
the tool used.
Jerome Flesch's avatar
Jerome Flesch committed
93

94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
Argument 'builder' is optionnal. Default value is
builders.TextBuilder().


### Orientation detection

Currently only available with Tesseract or Libtesseract.

```Python
if tool.can_detect_orientation():
    orientation = tool.detect_orientation(
        Image.open('test.png'),
        lang='fra'
    )
    pprint("Orientation: {}".format(orientation))
# Ex: Orientation: {
#   'angle': 90,
#   'confidence': 123.4,
# }
113
```
114

115 116 117 118 119 120 121 122 123 124
Angles are given in degrees (range: [0-360[). Exact possible
values depend of the tool used. Tesseract only returns angles =
0, 90, 180, 270.

Confidence is a score arbitrarily defined by the tool. It MAY not
be returned.

detect_orientation() MAY raise an exception if there is no text
detected in the image.

125

Jerome Flesch's avatar
Jerome Flesch committed
126
## Dependencies
Jerome Flesch's avatar
Jerome Flesch committed
127

Jerome Flesch's avatar
Jerome Flesch committed
128
* PyOCR requires python 2.7 or later. Python 3 is supported.
Jerome Flesch's avatar
Jerome Flesch committed
129 130 131 132
* You will need [Pillow](https://github.com/python-imaging/Pillow)
  or Python Imaging Library (PIL). Under Debian/Ubuntu, PIL is in
  the package "python-imaging".
* Install an OCR:
Jerome Flesch's avatar
Jerome Flesch committed
133 134 135
  * [libtesseract](http://code.google.com/p/tesseract-ocr/)
    ('libtesseract3' + 'tesseract-ocr-<lang>' in Debian).
  * or [tesseract-ocr](http://code.google.com/p/tesseract-ocr/)
Jerome Flesch's avatar
Jerome Flesch committed
136 137
    ('tesseract-ocr' + 'tesseract-ocr-<lang>' in Debian).
    You must be able to invoke the tesseract command as "tesseract".
Jerome Flesch's avatar
Jerome Flesch committed
138
    PyOCR is tested with Tesseract >= 3.01 only.
Jerome Flesch's avatar
Jerome Flesch committed
139 140 141
  * or cuneiform


Jerome Flesch's avatar
Jerome Flesch committed
142
## Installation
Jerome Flesch's avatar
Jerome Flesch committed
143 144 145 146

    $ sudo python ./setup.py install


Jerome Flesch's avatar
Jerome Flesch committed
147
## Tests
Jerome Flesch's avatar
Jerome Flesch committed
148

149 150
    $ python ./run_tests.py

Jerome Flesch's avatar
Jerome Flesch committed
151
Tests are made to be run with the latest versions of Tesseract and Cuneiform.
152 153
the first tests verify that you're using the expected version.

Jerome Flesch's avatar
Jerome Flesch committed
154
To run the tesseract tests, you will need the following lang data files:
155 156 157
- English (tesseract-ocr-eng)
- French (tesseract-ocr-fra)
- Japanese (tesseract-ocr-jpn)
Jerome Flesch's avatar
Jerome Flesch committed
158 159


160 161 162 163 164 165 166
## OCR on natural scenes

If you want to run OCR on natural scenes (photos, etc), you will have to filter
the image first. There are many algorithms possible to do that. One of those
who gives the best results is [Stroke Width Transform](https://github.com/jflesch/libpillowfight#stroke-width-transformation).


Jerome Flesch's avatar
Jerome Flesch committed
167
## Copyright
Jerome Flesch's avatar
Jerome Flesch committed
168

169
PyOCR is released under the GPL v3+.
Jerome Flesch's avatar
Jerome Flesch committed
170

Jerome Flesch's avatar
Jerome Flesch committed
171
tesseract.py:
Jerome Flesch's avatar
Jerome Flesch committed
172 173

* Copyright (c) Samuel Hoffstaetter, 2009
Jerome Flesch's avatar
Jerome Flesch committed
174
* Copyright (c) Jerome Flesch, 2011-2016
Jerome Flesch's avatar
Jerome Flesch committed
175

Jerome Flesch's avatar
Jerome Flesch committed
176
other files:
Jerome Flesch's avatar
Jerome Flesch committed
177

Jerome Flesch's avatar
Jerome Flesch committed
178
* Copyright (c) Jerome Flesch, 2011-2016
Jerome Flesch's avatar
Jerome Flesch committed
179 180

https://github.com/jflesch/pyocr