Commit df7607f1 authored by Jerome Flesch's avatar Jerome Flesch

Tesseract C-API: Output differs from Tesseract SH, but there is no obvious...

Tesseract C-API: Output differs from Tesseract SH, but there is no obvious reason. The most likely reason is that we use PIL to read the images instead of liblepton
So we simply use a separate set of expected output files.
Signed-off-by: Jerome Flesch's avatarJerome Flesch <jflesch@gmail.com>
parent 1447d5fa
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocr_line" title="bbox 105 66 823 113"> <span class="ocrx_word" title="bbox 105 66 178 97">The</span> <span class="ocrx_word" title="bbox 205 67 347 106">(quick)</span> <span class="ocrx_word" title="bbox 376 69 528 109">[brown]</span> <span class="ocrx_word" title="bbox 559 71 663 110">{fox}</span> <span class="ocrx_word" title="bbox 687 73 823 113">jumps!</span></span><br/>
<span class="ocr_line" title="bbox 104 115 887 165"> <span class="ocrx_word" title="bbox 104 115 199 147">Over</span> <span class="ocrx_word" title="bbox 224 117 283 148">the</span> <span class="ocrx_word" title="bbox 310 117 533 155">$43,456.78</span> <span class="ocrx_word" title="bbox 561 121 696 162">&lt;lazy&gt;</span> <span class="ocrx_word" title="bbox 722 123 791 154">#90</span> <span class="ocrx_word" title="bbox 818 125 887 165">dog</span></span><br/>
<span class="ocr_line" title="bbox 103 165 835 206"> <span class="ocrx_word" title="bbox 103 165 134 196">&amp;</span> <span class="ocrx_word" title="bbox 160 166 396 206">duck/goose,</span> <span class="ocrx_word" title="bbox 424 178 463 201">as</span> <span class="ocrx_word" title="bbox 493 171 614 203">12.5%</span> <span class="ocrx_word" title="bbox 638 172 680 204">of</span> <span class="ocrx_word" title="bbox 700 174 835 206">E-mail</span></span><br/>
<span class="ocr_line" title="bbox 103 215 911 264"> <span class="ocrx_word" title="bbox 103 215 194 247">from</span> <span class="ocrx_word" title="bbox 220 219 716 260">aspammer@website.com</span> <span class="ocrx_word" title="bbox 742 223 773 255">is</span> <span class="ocrx_word" title="bbox 799 233 911 264">spam.</span></span><br/>
<span class="ocr_line" title="bbox 102 266 877 314"> <span class="ocrx_word" title="bbox 102 266 173 297">Der</span> <span class="ocrx_word" title="bbox 198 267 406 302">,,schnelle”</span> <span class="ocrx_word" title="bbox 433 269 568 302">braune</span> <span class="ocrx_word" title="bbox 594 272 709 304">Fuchs</span> <span class="ocrx_word" title="bbox 735 274 877 314">springt</span></span><br/>
<span class="ocr_line" title="bbox 102 315 918 357"> <span class="ocrx_word" title="bbox 102 315 187 347">fiber</span> <span class="ocrx_word" title="bbox 212 317 280 348">den</span> <span class="ocrx_word" title="bbox 306 318 430 350">faulen</span> <span class="ocrx_word" title="bbox 456 320 572 352">Hund.</span> <span class="ocrx_word" title="bbox 601 322 648 354">Le</span> <span class="ocrx_word" title="bbox 674 324 803 356">renard</span> <span class="ocrx_word" title="bbox 827 325 918 357">brun</span></span><br/>
<span class="ocr_line" title="bbox 101 366 833 409"> <span class="ocrx_word" title="bbox 101 366 274 405">«rapide»</span> <span class="ocrx_word" title="bbox 302 373 403 400">saute</span> <span class="ocrx_word" title="bbox 428 371 641 409">par-dessus</span> <span class="ocrx_word" title="bbox 667 372 700 404">le</span> <span class="ocrx_word" title="bbox 725 374 833 406">chien</span></span><br/>
<span class="ocr_line" title="bbox 100 419 859 464"> <span class="ocrx_word" title="bbox 100 424 308 454">paresseux.</span> <span class="ocrx_word" title="bbox 337 419 384 450">La</span> <span class="ocrx_word" title="bbox 409 420 516 459">volpe</span> <span class="ocrx_word" title="bbox 543 430 707 455">marrone</span> <span class="ocrx_word" title="bbox 733 424 859 464">rapida</span></span><br/>
<span class="ocr_line" title="bbox 100 466 834 511"> <span class="ocrx_word" title="bbox 100 466 192 497">salta</span> <span class="ocrx_word" title="bbox 219 475 324 507">sopra</span> <span class="ocrx_word" title="bbox 351 468 376 499">i]</span> <span class="ocrx_word" title="bbox 403 478 491 501">cane</span> <span class="ocrx_word" title="bbox 517 471 633 511">pigro.</span> <span class="ocrx_word" title="bbox 662 473 703 504">El</span> <span class="ocrx_word" title="bbox 729 482 834 506">zorro</span></span><br/>
<span class="ocr_line" title="bbox 99 516 833 563"> <span class="ocrx_word" title="bbox 99 516 242 548">marrén</span> <span class="ocrx_word" title="bbox 268 517 395 557">répido</span> <span class="ocrx_word" title="bbox 421 520 513 552">salta</span> <span class="ocrx_word" title="bbox 540 521 644 554">sobre</span> <span class="ocrx_word" title="bbox 669 523 702 554">el</span> <span class="ocrx_word" title="bbox 728 532 833 563">perro</span></span><br/>
<span class="ocr_line" title="bbox 98 568 829 613"> <span class="ocrx_word" title="bbox 98 574 284 604">perezoso.</span> <span class="ocrx_word" title="bbox 313 568 342 598">A</span> <span class="ocrx_word" title="bbox 369 578 497 609">raposa</span> <span class="ocrx_word" title="bbox 523 579 677 604">marrom</span> <span class="ocrx_word" title="bbox 703 573 829 613">répida</span></span><br/>
<span class="ocr_line" title="bbox 98 616 710 661"> <span class="ocrx_word" title="bbox 98 616 190 647">salta</span> <span class="ocrx_word" title="bbox 217 617 320 649">sobre</span> <span class="ocrx_word" title="bbox 346 627 366 650">0</span> <span class="ocrx_word" title="bbox 391 621 456 651">C50</span> <span class="ocrx_word" title="bbox 481 621 710 661">preguieoso.</span></span><br/>
</body>
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,,schnelle” braune Fuchs springt
fiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra i] cane pigro. El zorro
marrén répido salta sobre el perro
perezoso. A raposa marrom répida
salta sobre 0 C50 preguieoso.
\ No newline at end of file
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocrx_word" title="bbox 105 66 178 97">The</span><br/>
<span class="ocrx_word" title="bbox 205 67 347 106">(quick)</span><br/>
<span class="ocrx_word" title="bbox 376 69 528 109">[brown]</span><br/>
<span class="ocrx_word" title="bbox 559 71 663 110">{fox}</span><br/>
<span class="ocrx_word" title="bbox 687 73 823 113">jumps!</span><br/>
<span class="ocrx_word" title="bbox 104 115 199 147">Over</span><br/>
<span class="ocrx_word" title="bbox 224 117 283 148">the</span><br/>
<span class="ocrx_word" title="bbox 310 117 533 155">$43,456.78</span><br/>
<span class="ocrx_word" title="bbox 561 121 696 162">&lt;lazy&gt;</span><br/>
<span class="ocrx_word" title="bbox 722 123 791 154">#90</span><br/>
<span class="ocrx_word" title="bbox 818 125 887 165">dog</span><br/>
<span class="ocrx_word" title="bbox 103 165 134 196">&amp;</span><br/>
<span class="ocrx_word" title="bbox 160 166 396 206">duck/goose,</span><br/>
<span class="ocrx_word" title="bbox 424 178 463 201">as</span><br/>
<span class="ocrx_word" title="bbox 493 171 614 203">12.5%</span><br/>
<span class="ocrx_word" title="bbox 638 172 680 204">of</span><br/>
<span class="ocrx_word" title="bbox 700 174 835 206">E-mail</span><br/>
<span class="ocrx_word" title="bbox 103 215 194 247">from</span><br/>
<span class="ocrx_word" title="bbox 220 219 716 260">aspammer@website.com</span><br/>
<span class="ocrx_word" title="bbox 742 223 773 255">is</span><br/>
<span class="ocrx_word" title="bbox 799 233 911 264">spam.</span><br/>
<span class="ocrx_word" title="bbox 102 266 173 297">Der</span><br/>
<span class="ocrx_word" title="bbox 198 267 406 302">,,schnelle”</span><br/>
<span class="ocrx_word" title="bbox 433 269 568 302">braune</span><br/>
<span class="ocrx_word" title="bbox 594 272 709 304">Fuchs</span><br/>
<span class="ocrx_word" title="bbox 735 274 877 314">springt</span><br/>
<span class="ocrx_word" title="bbox 102 315 187 347">fiber</span><br/>
<span class="ocrx_word" title="bbox 212 317 280 348">den</span><br/>
<span class="ocrx_word" title="bbox 306 318 430 350">faulen</span><br/>
<span class="ocrx_word" title="bbox 456 320 572 352">Hund.</span><br/>
<span class="ocrx_word" title="bbox 601 322 648 354">Le</span><br/>
<span class="ocrx_word" title="bbox 674 324 803 356">renard</span><br/>
<span class="ocrx_word" title="bbox 827 325 918 357">brun</span><br/>
<span class="ocrx_word" title="bbox 101 366 274 405">«rapide»</span><br/>
<span class="ocrx_word" title="bbox 302 373 403 400">saute</span><br/>
<span class="ocrx_word" title="bbox 428 371 641 409">par-dessus</span><br/>
<span class="ocrx_word" title="bbox 667 372 700 404">le</span><br/>
<span class="ocrx_word" title="bbox 725 374 833 406">chien</span><br/>
<span class="ocrx_word" title="bbox 100 424 308 454">paresseux.</span><br/>
<span class="ocrx_word" title="bbox 337 419 384 450">La</span><br/>
<span class="ocrx_word" title="bbox 409 420 516 459">volpe</span><br/>
<span class="ocrx_word" title="bbox 543 430 707 455">marrone</span><br/>
<span class="ocrx_word" title="bbox 733 424 859 464">rapida</span><br/>
<span class="ocrx_word" title="bbox 100 466 192 497">salta</span><br/>
<span class="ocrx_word" title="bbox 219 475 324 507">sopra</span><br/>
<span class="ocrx_word" title="bbox 351 468 376 499">i]</span><br/>
<span class="ocrx_word" title="bbox 403 478 491 501">cane</span><br/>
<span class="ocrx_word" title="bbox 517 471 633 511">pigro.</span><br/>
<span class="ocrx_word" title="bbox 662 473 703 504">El</span><br/>
<span class="ocrx_word" title="bbox 729 482 834 506">zorro</span><br/>
<span class="ocrx_word" title="bbox 99 516 242 548">marrén</span><br/>
<span class="ocrx_word" title="bbox 268 517 395 557">répido</span><br/>
<span class="ocrx_word" title="bbox 421 520 513 552">salta</span><br/>
<span class="ocrx_word" title="bbox 540 521 644 554">sobre</span><br/>
<span class="ocrx_word" title="bbox 669 523 702 554">el</span><br/>
<span class="ocrx_word" title="bbox 728 532 833 563">perro</span><br/>
<span class="ocrx_word" title="bbox 98 574 284 604">perezoso.</span><br/>
<span class="ocrx_word" title="bbox 313 568 342 598">A</span><br/>
<span class="ocrx_word" title="bbox 369 578 497 609">raposa</span><br/>
<span class="ocrx_word" title="bbox 523 579 677 604">marrom</span><br/>
<span class="ocrx_word" title="bbox 703 573 829 613">répida</span><br/>
<span class="ocrx_word" title="bbox 98 616 190 647">salta</span><br/>
<span class="ocrx_word" title="bbox 217 617 320 649">sobre</span><br/>
<span class="ocrx_word" title="bbox 346 627 366 650">0</span><br/>
<span class="ocrx_word" title="bbox 391 621 456 651">C50</span><br/>
<span class="ocrx_word" title="bbox 481 621 710 661">preguieoso.</span><br/>
</body>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocr_line" title="bbox 23 36 186 55"> <span class="ocrx_word" title="bbox 23 36 81 51">PhraSe</span> <span class="ocrx_word" title="bbox 87 41 108 51">en</span> <span class="ocrx_word" title="bbox 115 37 186 55">français.</span></span><br/>
<span class="ocr_line" title="bbox 21 57 174 78"> <span class="ocrx_word" title="bbox 21 58 63 78">Avec</span> <span class="ocrx_word" title="bbox 70 57 99 72">des</span> <span class="ocrx_word" title="bbox 106 60 174 73">accents.</span></span><br/>
<span class="ocr_line" title="bbox 23 78 112 96"> <span class="ocrx_word" title="bbox 23 78 112 96">Ephémère</span></span><br/>
</body>
PhraSe en français.
Avec des accents.
Ephémère
\ No newline at end of file
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocrx_word" title="bbox 23 36 81 51">PhraSe</span><br/>
<span class="ocrx_word" title="bbox 87 41 108 51">en</span><br/>
<span class="ocrx_word" title="bbox 115 37 186 55">français.</span><br/>
<span class="ocrx_word" title="bbox 21 58 63 78">Avec</span><br/>
<span class="ocrx_word" title="bbox 70 57 99 72">des</span><br/>
<span class="ocrx_word" title="bbox 106 60 174 73">accents.</span><br/>
<span class="ocrx_word" title="bbox 23 78 112 96">Ephémère</span><br/>
</body>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocr_line" title="bbox 34 23 1151 111"> <span class="ocrx_word" title="bbox 34 23 617 111">たいなかりつ</span> <span class="ocrx_word" title="bbox 666 23 1151 111">おれのよめ</span></span><br/>
<span class="ocr_line" title="bbox 34 171 220 195"> <span class="ocrx_word" title="bbox 34 171 220 195"> </span> <span class="ocrx_word" title="bbox 34 171 220 197"> </span> <span class="ocrx_word" title="bbox 154 172 159 253"> </span> <span class="ocrx_word" title="bbox 234 188 270 238">=</span></span><br/>
<span class="ocr_line" title="bbox 35 206 222 228"> <span class="ocrx_word" title="bbox 35 206 222 228"> </span> <span class="ocrx_word" title="bbox 70 177 81 257"> </span> <span class="ocrx_word" title="bbox 188 172 199 264"> </span> <span class="ocrx_word" title="bbox 270 171 281 264"> </span> <span class="ocrx_word" title="bbox 35 177 46 263"> </span> <span class="ocrx_word" title="bbox 106 177 117 263"> </span> <span class="ocrx_word" title="bbox 35 247 117 257"> </span> <span class="ocrx_word" title="bbox 381 171 392 264"> </span> <span class="ocrx_word" title="bbox 281 171 381 264"></span></span><br/>
<span class="ocr_line" title="bbox 392 171 757 264"> <span class="ocrx_word" title="bbox 392 171 757 264">童俺の嫁</span></span><br/>
<span class="ocr_line" title="bbox 27 309 973 366"> <span class="ocrx_word" title="bbox 27 311 147 364">abC</span> <span class="ocrx_word" title="bbox 172 310 319 364">ABC</span> <span class="ocrx_word" title="bbox 348 309 783 365">ぁいうぇおかき</span> <span class="ocrx_word" title="bbox 803 309 840 366"></span> <span class="ocrx_word" title="bbox 863 310 973 366">けこ</span></span><br/>
<span class="ocr_line" title="bbox 26 406 661 460"> <span class="ocrx_word" title="bbox 26 406 152 460">00ー</span> <span class="ocrx_word" title="bbox 173 406 661 460">ー2ー3B45S6789</span></span><br/>
<span class="ocr_line" title="bbox 263 513 723 613"> <span class="ocrx_word" title="bbox 263 516 348 612">F</span> <span class="ocrx_word" title="bbox 451 513 533 612">U</span> <span class="ocrx_word" title="bbox 639 513 723 613">N</span></span><br/>
<span class="ocr_line" title="bbox 316 655 723 694"> <span class="ocrx_word" title="bbox 316 655 723 694">ぁったりまぇじゃん!</span></span><br/>
<span class="ocr_line" title="bbox 616 716 723 749"> <span class="ocrx_word" title="bbox 616 716 723 749">rit$u</span></span><br/>
<span class="ocr_line" title="bbox 478 735 610 740"> <span class="ocrx_word" title="bbox 478 735 610 740"> </span></span><br/>
</body>
たいなかりつ おれのよめ
=
童俺の嫁
abC ABC ぁいうぇおかき く けこ
00ー ー2ー3B45S6789
F U N
ぁったりまぇじゃん!
rit$u
\ No newline at end of file
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocrx_word" title="bbox 34 23 617 111">たいなかりつ</span><br/>
<span class="ocrx_word" title="bbox 666 23 1151 111">おれのよめ</span><br/>
<span class="ocrx_word" title="bbox 34 171 220 195"> </span><br/>
<span class="ocrx_word" title="bbox 34 171 220 197"> </span><br/>
<span class="ocrx_word" title="bbox 154 172 159 253"> </span><br/>
<span class="ocrx_word" title="bbox 234 188 270 238">=</span><br/>
<span class="ocrx_word" title="bbox 35 206 222 228"> </span><br/>
<span class="ocrx_word" title="bbox 70 177 81 257"> </span><br/>
<span class="ocrx_word" title="bbox 188 172 199 264"> </span><br/>
<span class="ocrx_word" title="bbox 270 171 281 264"> </span><br/>
<span class="ocrx_word" title="bbox 35 177 46 263"> </span><br/>
<span class="ocrx_word" title="bbox 106 177 117 263"> </span><br/>
<span class="ocrx_word" title="bbox 35 247 117 257"> </span><br/>
<span class="ocrx_word" title="bbox 381 171 392 264"> </span><br/>
<span class="ocrx_word" title="bbox 281 171 381 264"></span><br/>
<span class="ocrx_word" title="bbox 392 171 757 264">童俺の嫁</span><br/>
<span class="ocrx_word" title="bbox 27 311 147 364">abC</span><br/>
<span class="ocrx_word" title="bbox 172 310 319 364">ABC</span><br/>
<span class="ocrx_word" title="bbox 348 309 783 365">ぁいうぇおかき</span><br/>
<span class="ocrx_word" title="bbox 803 309 840 366"></span><br/>
<span class="ocrx_word" title="bbox 863 310 973 366">けこ</span><br/>
<span class="ocrx_word" title="bbox 26 406 152 460">00ー</span><br/>
<span class="ocrx_word" title="bbox 173 406 661 460">ー2ー3B45S6789</span><br/>
<span class="ocrx_word" title="bbox 263 516 348 612">F</span><br/>
<span class="ocrx_word" title="bbox 451 513 533 612">U</span><br/>
<span class="ocrx_word" title="bbox 639 513 723 613">N</span><br/>
<span class="ocrx_word" title="bbox 316 655 723 694">ぁったりまぇじゃん!</span><br/>
<span class="ocrx_word" title="bbox 616 716 723 749">rit$u</span><br/>
<span class="ocrx_word" title="bbox 478 735 610 740"> </span><br/>
</body>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocr_line" title="bbox 36 92 580 122"> <span class="ocrx_word" title="bbox 36 92 96 116">This</span> <span class="ocrx_word" title="bbox 109 92 129 116">is</span> <span class="ocrx_word" title="bbox 141 98 156 116">a</span> <span class="ocrx_word" title="bbox 169 92 201 116">lot</span> <span class="ocrx_word" title="bbox 212 92 240 116">of</span> <span class="ocrx_word" title="bbox 251 92 282 116">12</span> <span class="ocrx_word" title="bbox 296 92 364 122">point</span> <span class="ocrx_word" title="bbox 374 93 427 116">text</span> <span class="ocrx_word" title="bbox 437 93 463 116">to</span> <span class="ocrx_word" title="bbox 474 93 526 116">test</span> <span class="ocrx_word" title="bbox 536 92 580 116">the</span></span><br/>
<span class="ocr_line" title="bbox 36 126 618 157"> <span class="ocrx_word" title="bbox 36 132 81 150">ocr</span> <span class="ocrx_word" title="bbox 91 126 160 150">code</span> <span class="ocrx_word" title="bbox 172 126 223 150">and</span> <span class="ocrx_word" title="bbox 236 132 286 150">see</span> <span class="ocrx_word" title="bbox 299 126 314 150">if</span> <span class="ocrx_word" title="bbox 325 126 339 150">it</span> <span class="ocrx_word" title="bbox 348 126 433 150">works</span> <span class="ocrx_word" title="bbox 445 132 478 150">on</span> <span class="ocrx_word" title="bbox 500 126 529 150">all</span> <span class="ocrx_word" title="bbox 541 127 618 157">types</span></span><br/>
<span class="ocr_line" title="bbox 36 160 223 184"> <span class="ocrx_word" title="bbox 36 160 64 184">of</span> <span class="ocrx_word" title="bbox 72 160 113 184">file</span> <span class="ocrx_word" title="bbox 123 160 223 184">format.</span></span><br/>
<span class="ocr_line" title="bbox 36 194 585 225"> <span class="ocrx_word" title="bbox 36 194 91 218">The</span> <span class="ocrx_word" title="bbox 102 194 177 224">quick</span> <span class="ocrx_word" title="bbox 189 194 274 218">brown</span> <span class="ocrx_word" title="bbox 287 194 339 225">dog</span> <span class="ocrx_word" title="bbox 348 194 456 225">jumped</span> <span class="ocrx_word" title="bbox 468 200 531 218">over</span> <span class="ocrx_word" title="bbox 540 194 585 218">the</span></span><br/>
<span class="ocr_line" title="bbox 37 228 585 259"> <span class="ocrx_word" title="bbox 37 228 92 259">lazy</span> <span class="ocrx_word" title="bbox 103 228 153 252">fox.</span> <span class="ocrx_word" title="bbox 165 228 220 252">The</span> <span class="ocrx_word" title="bbox 232 228 307 258">quick</span> <span class="ocrx_word" title="bbox 319 228 404 252">brown</span> <span class="ocrx_word" title="bbox 417 228 468 259">dog</span> <span class="ocrx_word" title="bbox 478 228 585 259">jumped</span></span><br/>
<span class="ocr_line" title="bbox 36 262 597 293"> <span class="ocrx_word" title="bbox 36 268 99 286">over</span> <span class="ocrx_word" title="bbox 109 262 153 286">the</span> <span class="ocrx_word" title="bbox 165 262 221 293">lazy</span> <span class="ocrx_word" title="bbox 231 262 281 286">fox.</span> <span class="ocrx_word" title="bbox 294 262 349 286">The</span> <span class="ocrx_word" title="bbox 360 262 435 292">quick</span> <span class="ocrx_word" title="bbox 447 262 532 286">brown</span> <span class="ocrx_word" title="bbox 545 262 597 293">dog</span></span><br/>
<span class="ocr_line" title="bbox 43 296 561 327"> <span class="ocrx_word" title="bbox 43 296 150 327">jumped</span> <span class="ocrx_word" title="bbox 162 302 226 320">over</span> <span class="ocrx_word" title="bbox 235 296 279 320">the</span> <span class="ocrx_word" title="bbox 292 296 347 327">lazy</span> <span class="ocrx_word" title="bbox 357 296 407 320">fox.</span> <span class="ocrx_word" title="bbox 420 296 475 320">The</span> <span class="ocrx_word" title="bbox 486 296 561 326">quick</span></span><br/>
<span class="ocr_line" title="bbox 37 330 561 361"> <span class="ocrx_word" title="bbox 37 330 122 354">brown</span> <span class="ocrx_word" title="bbox 135 330 187 361">dog</span> <span class="ocrx_word" title="bbox 196 330 304 361">jumped</span> <span class="ocrx_word" title="bbox 316 336 379 354">over</span> <span class="ocrx_word" title="bbox 388 330 433 354">the</span> <span class="ocrx_word" title="bbox 445 330 500 361">lazy</span> <span class="ocrx_word" title="bbox 511 330 561 354">fox.</span></span><br/>
</body>
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
\ No newline at end of file
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<span class="ocrx_word" title="bbox 36 92 96 116">This</span><br/>
<span class="ocrx_word" title="bbox 109 92 129 116">is</span><br/>
<span class="ocrx_word" title="bbox 141 98 156 116">a</span><br/>
<span class="ocrx_word" title="bbox 169 92 201 116">lot</span><br/>
<span class="ocrx_word" title="bbox 212 92 240 116">of</span><br/>
<span class="ocrx_word" title="bbox 251 92 282 116">12</span><br/>
<span class="ocrx_word" title="bbox 296 92 364 122">point</span><br/>
<span class="ocrx_word" title="bbox 374 93 427 116">text</span><br/>
<span class="ocrx_word" title="bbox 437 93 463 116">to</span><br/>
<span class="ocrx_word" title="bbox 474 93 526 116">test</span><br/>
<span class="ocrx_word" title="bbox 536 92 580 116">the</span><br/>
<span class="ocrx_word" title="bbox 36 132 81 150">ocr</span><br/>
<span class="ocrx_word" title="bbox 91 126 160 150">code</span><br/>
<span class="ocrx_word" title="bbox 172 126 223 150">and</span><br/>
<span class="ocrx_word" title="bbox 236 132 286 150">see</span><br/>
<span class="ocrx_word" title="bbox 299 126 314 150">if</span><br/>
<span class="ocrx_word" title="bbox 325 126 339 150">it</span><br/>
<span class="ocrx_word" title="bbox 348 126 433 150">works</span><br/>
<span class="ocrx_word" title="bbox 445 132 478 150">on</span><br/>
<span class="ocrx_word" title="bbox 500 126 529 150">all</span><br/>
<span class="ocrx_word" title="bbox 541 127 618 157">types</span><br/>
<span class="ocrx_word" title="bbox 36 160 64 184">of</span><br/>
<span class="ocrx_word" title="bbox 72 160 113 184">file</span><br/>
<span class="ocrx_word" title="bbox 123 160 223 184">format.</span><br/>
<span class="ocrx_word" title="bbox 36 194 91 218">The</span><br/>
<span class="ocrx_word" title="bbox 102 194 177 224">quick</span><br/>
<span class="ocrx_word" title="bbox 189 194 274 218">brown</span><br/>
<span class="ocrx_word" title="bbox 287 194 339 225">dog</span><br/>
<span class="ocrx_word" title="bbox 348 194 456 225">jumped</span><br/>
<span class="ocrx_word" title="bbox 468 200 531 218">over</span><br/>
<span class="ocrx_word" title="bbox 540 194 585 218">the</span><br/>
<span class="ocrx_word" title="bbox 37 228 92 259">lazy</span><br/>
<span class="ocrx_word" title="bbox 103 228 153 252">fox.</span><br/>
<span class="ocrx_word" title="bbox 165 228 220 252">The</span><br/>
<span class="ocrx_word" title="bbox 232 228 307 258">quick</span><br/>
<span class="ocrx_word" title="bbox 319 228 404 252">brown</span><br/>
<span class="ocrx_word" title="bbox 417 228 468 259">dog</span><br/>
<span class="ocrx_word" title="bbox 478 228 585 259">jumped</span><br/>
<span class="ocrx_word" title="bbox 36 268 99 286">over</span><br/>
<span class="ocrx_word" title="bbox 109 262 153 286">the</span><br/>
<span class="ocrx_word" title="bbox 165 262 221 293">lazy</span><br/>
<span class="ocrx_word" title="bbox 231 262 281 286">fox.</span><br/>
<span class="ocrx_word" title="bbox 294 262 349 286">The</span><br/>
<span class="ocrx_word" title="bbox 360 262 435 292">quick</span><br/>
<span class="ocrx_word" title="bbox 447 262 532 286">brown</span><br/>
<span class="ocrx_word" title="bbox 545 262 597 293">dog</span><br/>
<span class="ocrx_word" title="bbox 43 296 150 327">jumped</span><br/>
<span class="ocrx_word" title="bbox 162 302 226 320">over</span><br/>
<span class="ocrx_word" title="bbox 235 296 279 320">the</span><br/>
<span class="ocrx_word" title="bbox 292 296 347 327">lazy</span><br/>
<span class="ocrx_word" title="bbox 357 296 407 320">fox.</span><br/>
<span class="ocrx_word" title="bbox 420 296 475 320">The</span><br/>
<span class="ocrx_word" title="bbox 486 296 561 326">quick</span><br/>
<span class="ocrx_word" title="bbox 37 330 122 354">brown</span><br/>
<span class="ocrx_word" title="bbox 135 330 187 361">dog</span><br/>
<span class="ocrx_word" title="bbox 196 330 304 361">jumped</span><br/>
<span class="ocrx_word" title="bbox 316 336 379 354">over</span><br/>
<span class="ocrx_word" title="bbox 388 330 433 354">the</span><br/>
<span class="ocrx_word" title="bbox 445 330 500 361">lazy</span><br/>
<span class="ocrx_word" title="bbox 511 330 561 354">fox.</span><br/>
</body>
......@@ -60,7 +60,7 @@ class TestTxt(unittest.TestCase):
def __test_txt(self, image_file, expected_output_file, lang='eng'):
image_file = "tests/data/" + image_file
expected_output_file = "tests/tesseract/" + expected_output_file
expected_output_file = "tests/tesseract_capi/" + expected_output_file
expected_output = ""
with codecs.open(expected_output_file, 'r', encoding='utf-8') \
......@@ -100,7 +100,7 @@ class TestWordBox(unittest.TestCase):
def __test_txt(self, image_file, expected_box_file, lang='eng'):
image_file = "tests/data/" + image_file
expected_box_file = "tests/tesseract/" + expected_box_file
expected_box_file = "tests/tesseract_capi/" + expected_box_file
with codecs.open(expected_box_file, 'r', encoding='utf-8') \
as file_descriptor:
......@@ -174,7 +174,7 @@ class TestLineBox(unittest.TestCase):
def __test_txt(self, image_file, expected_box_file, lang='eng'):
image_file = "tests/data/" + image_file
expected_box_file = "tests/tesseract/" + expected_box_file
expected_box_file = "tests/tesseract_capi/" + expected_box_file
boxes = tesseract_capi.image_to_string(
Image.open(image_file), lang=lang,
......
......@@ -9,21 +9,55 @@ run_tess()
lang="$1"
shift
echo "${img} --> ${out} (${lang} / $@)"
lang_arg=""
if [ -n "${lang}" ]; then
lang_arg=-l
fi
echo tesseract ${img} ${out} ${lang_arg} ${lang} $@
if ! tesseract ${img} ${out} ${lang_arg} ${lang} $@ > /dev/null 2>&1
then
echo "FAILED !"
fi
}
run_tess_capi()
{
img="$1"
shift
out="$1"
shift
lang="$1"
shift
builder="$1"
shift
echo "${img} --> ${out} (${lang} / ${builder})"
lang_arg=""
if [ -n "${lang}" ]; then
lang_arg=-l
fi
cat << EOF | python3
from PIL import Image
from pyocr import tesseract_capi
from pyocr import builders
img = Image.open("${img}")
builder = builders.${builder}()
out = tesseract_capi.image_to_string(img, lang="${lang}", builder=builder)
with open("${out}", "w") as fd:
builder.write_file(fd, out)
EOF
}
cd tests
echo "=== Tesseract sh ==="
run_tess data/test.png tesseract/test eng
run_tess data/test.png tesseract/test eng batch.nochop makebox
run_tess data/test.png tesseract/test eng hocr
......@@ -49,3 +83,21 @@ run_tess data/test-japanese.jpg tesseract/test-japanese jpn batch.nochop makebox
run_tess data/test-japanese.jpg tesseract/test-japanese jpn hocr
mv tesseract/test-japanese.hocr tesseract/test-japanese.words
cp tesseract/test-japanese.words tesseract/test-japanese.lines
echo "=== Tesseract C-api ==="
run_tess_capi data/test.png tesseract_capi/test.txt eng TextBuilder
run_tess_capi data/test.png tesseract_capi/test.words eng WordBoxBuilder
run_tess_capi data/test.png tesseract_capi/test.lines eng LineBoxBuilder
run_tess_capi data/test-european.jpg tesseract_capi/test-european.txt eng TextBuilder
run_tess_capi data/test-european.jpg tesseract_capi/test-european.words eng WordBoxBuilder
run_tess_capi data/test-european.jpg tesseract_capi/test-european.lines eng LineBoxBuilder
run_tess_capi data/test-french.jpg tesseract_capi/test-french.txt fra TextBuilder
run_tess_capi data/test-french.jpg tesseract_capi/test-french.words fra WordBoxBuilder
run_tess_capi data/test-french.jpg tesseract_capi/test-french.lines fra LineBoxBuilder
run_tess_capi data/test-japanese.jpg tesseract_capi/test-japanese.txt jpn TextBuilder
run_tess_capi data/test-japanese.jpg tesseract_capi/test-japanese.words jpn WordBoxBuilder
run_tess_capi data/test-japanese.jpg tesseract_capi/test-japanese.lines jpn LineBoxBuilder
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment