|
|
workdir|rootdir = ~/papers
|
|
|
|
|
|
In the work directory, you have folders, one per document.
|
|
|
|
|
|
The folder names are (usually) the scan/import date of the document:
|
|
|
YYYYMMDD\_hhmm\_ss[\_<idx>]. The suffix 'idx' is optional and is just
|
|
|
a number added in case of name collision.
|
|
|
|
|
|
In every folder you have:
|
|
|
|
|
|
* For image documents:
|
|
|
* paper.<X>.jpg : A page in JPG format (X starts at 1)
|
|
|
* paper.<X>.words : A
|
|
|
[hOCR](https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview)
|
|
|
file, containing all the words found on the page using the OCR.
|
|
|
* paper.<X>.thumb.jpg (optional) : A thumbnail version of the page (faster to load)
|
|
|
* labels (optional) : a text file containing the labels applied on this document
|
|
|
* extra.txt (optional) : extra keywords added by the user
|
|
|
* For PDF documents:
|
|
|
* doc.pdf : the document
|
|
|
* labels (optional) : a text file containing the labels applied on this document
|
|
|
* extra.txt (optional) : extra keywords added by the user
|
|
|
* paper.<X>.words (optional) : A
|
|
|
[hOCR](https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview)
|
|
|
file, containing all the words found on the page using the OCR. Some PDF contains crap instead
|
|
|
of the real text, so running the OCR on them can sometimes be useful.
|
|
|
|
|
|
With Tesseract, the hOCR file can be obtained with following command:
|
|
|
|
|
|
tesseract paper.<X>.jpg paper.<X> -l <lang> hocr && mv paper.<X>.html paper.<X>.words
|
|
|
|
|
|
For example:
|
|
|
|
|
|
tesseract paper.1.jpg paper.1 -l fra hocr && mv paper.1.html paper.1.words
|
|
|
|
|
|
Here is an example a work directory organisation:
|
|
|
|
|
|
$ find ~/papers
|
|
|
/home/jflesch/papers
|
|
|
/home/jflesch/papers/20130505_1518_00
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.1.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.1.thumb.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.1.words
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.2.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.2.thumb.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.2.words
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.3.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.3.thumb.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.3.words
|
|
|
/home/jflesch/papers/20130505_1518_00/labels
|
|
|
/home/jflesch/papers/20110726_0000_01
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.1.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.1.thumb.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.1.words
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.2.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.2.thumb.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.2.words
|
|
|
/home/jflesch/papers/20110726_0000_01/extra.txt
|
|
|
/home/jflesch/papers/20130106_1309_44
|
|
|
/home/jflesch/papers/20130106_1309_44/doc.pdf
|
|
|
/home/jflesch/papers/20130106_1309_44/paper.1.words
|
|
|
/home/jflesch/papers/20130106_1309_44/paper.2.words
|
|
|
/home/jflesch/papers/20130106_1309_44/labels
|
|
|
/home/jflesch/papers/20130106_1309_44/extra.txt |
|
|
workdir|rootdir = ~/papers
|
|
|
|
|
|
In the work directory, you have folders, one per document.
|
|
|
|
|
|
The folder names are (usually) the scan/import date of the document:
|
|
|
YYYYMMDD\_hhmm\_ss[\_<idx>]. The suffix 'idx' is optional and is just
|
|
|
a number added in case of name collision.
|
|
|
|
|
|
In every folder you have:
|
|
|
|
|
|
* For image documents:
|
|
|
* paper.<X>.jpg : A page in JPG format (X starts at 1)
|
|
|
* paper.<X>.words : A
|
|
|
[hOCR](https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview)
|
|
|
file, containing all the words found on the page using the OCR.
|
|
|
* paper.<X>.thumb.jpg (optional) : A thumbnail version of the page (faster to load)
|
|
|
* labels (optional) : a text file containing the labels applied on this document
|
|
|
* extra.txt (optional) : extra keywords added by the user
|
|
|
* For PDF documents:
|
|
|
* doc.pdf : the document
|
|
|
* labels (optional) : a text file containing the labels applied on this document
|
|
|
* extra.txt (optional) : extra keywords added by the user
|
|
|
* paper.<X>.words (optional) : A
|
|
|
[hOCR](https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview)
|
|
|
file, containing all the words found on the page using the OCR. Some PDF contains crap instead
|
|
|
of the real text, so running the OCR on them can sometimes be useful.
|
|
|
|
|
|
With Tesseract, the hOCR file can be obtained with following command:
|
|
|
|
|
|
tesseract paper.<X>.jpg paper.<X> -l <lang> hocr && mv paper.<X>.html paper.<X>.words
|
|
|
|
|
|
For example:
|
|
|
|
|
|
tesseract paper.1.jpg paper.1 -l fra hocr && mv paper.1.html paper.1.words
|
|
|
|
|
|
Here is an example a work directory organisation:
|
|
|
|
|
|
$ find ~/papers
|
|
|
/home/jflesch/papers
|
|
|
/home/jflesch/papers/20130505_1518_00
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.1.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.1.thumb.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.1.words
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.2.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.2.thumb.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.2.words
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.3.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.3.thumb.jpg
|
|
|
/home/jflesch/papers/20130505_1518_00/paper.3.words
|
|
|
/home/jflesch/papers/20130505_1518_00/labels
|
|
|
/home/jflesch/papers/20110726_0000_01
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.1.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.1.thumb.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.1.words
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.2.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.2.thumb.jpg
|
|
|
/home/jflesch/papers/20110726_0000_01/paper.2.words
|
|
|
/home/jflesch/papers/20110726_0000_01/extra.txt
|
|
|
/home/jflesch/papers/20130106_1309_44
|
|
|
/home/jflesch/papers/20130106_1309_44/doc.pdf
|
|
|
/home/jflesch/papers/20130106_1309_44/paper.1.words
|
|
|
/home/jflesch/papers/20130106_1309_44/paper.2.words
|
|
|
/home/jflesch/papers/20130106_1309_44/labels
|
|
|
/home/jflesch/papers/20130106_1309_44/extra.txt
|
|
|
|
|
|
Here is an example of content of a label file:
|
|
|
|
|
|
```
|
|
|
facture,#0000b1588c61
|
|
|
logement,#f6b6ffff0000
|
|
|
```
|
|
|
|
|
|
It's always [label],[color]. For a same label, the color should always be the same. |
|
|
\ No newline at end of file |