cuneiform team mailing list archive
-
cuneiform team
-
Mailing list archive
-
Message #00502
[Bug 344790] Re: OCR quality drops
I test 3 version of cuneiform: official free version (from openocr.org site), fro cuneiform and refactoring branch with bash script from https://bugs.launchpad.net/cuneiform-linux/+bug/344790/comments/5.
To test I use 106 files with english text without images and tables from 3b.tgz archive. If you wont I can distribute list of it.
The results:
1.a. Official with dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
231979 Characters
11151 Errors
95.19% Accuracy
461 Reject Characters
0 Suspect Markers
0 False Marks
0.20% Characters Marked
95.83% Accuracy After Correction
Ins Subst Del Errors
91 239 1143 1473 Marked
3052 3511 3115 9678 Unmarked
3143 3750 4258 11151 Total
Count Missed %Right
35888 740 97.94 ASCII Spacing Characters
8668 662 92.36 ASCII Special Symbols
6038 904 85.03 ASCII Digits
11546 551 95.23 ASCII Uppercase Letters
169839 4036 97.62 ASCII Lowercase Letters
231979 6893 97.03 Total
1.b. Official without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
231979 Characters
10879 Errors
95.31% Accuracy
483 Reject Characters
0 Suspect Markers
0 False Marks
0.21% Characters Marked
95.89% Accuracy After Correction
Ins Subst Del Errors
130 196 1022 1348 Marked
3217 3304 3010 9531 Unmarked
3347 3500 4032 10879 Total
Count Missed %Right
35888 751 97.91 ASCII Spacing Characters
8668 662 92.36 ASCII Special Symbols
6038 895 85.18 ASCII Digits
11546 542 95.31 ASCII Uppercase Letters
169839 3997 97.65 ASCII Lowercase Letters
231979 6847 97.05 Total
2.a. refactoring with dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
231979 Characters
13086 Errors
94.36% Accuracy
456 Reject Characters
0 Suspect Markers
0 False Marks
0.20% Characters Marked
95.03% Accuracy After Correction
Ins Subst Del Errors
99 227 1242 1568 Marked
4592 2830 4096 11518 Unmarked
4691 3057 5338 13086 Total
Count Missed %Right
35888 944 97.37 ASCII Spacing Characters
8668 733 91.54 ASCII Special Symbols
6038 846 85.99 ASCII Digits
11546 629 94.55 ASCII Uppercase Letters
169839 4596 97.29 ASCII Lowercase Letters
231979 7748 96.66 Total
2.b. refactoring without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
231979 Characters
13333 Errors
94.25% Accuracy
478 Reject Characters
0 Suspect Markers
0 False Marks
0.21% Characters Marked
94.97% Accuracy After Correction
Ins Subst Del Errors
133 247 1293 1673 Marked
4805 2930 3925 11660 Unmarked
4938 3177 5218 13333 Total
Count Missed %Right
35888 994 97.23 ASCII Spacing Characters
8668 747 91.38 ASCII Special Symbols
6038 853 85.87 ASCII Digits
11546 669 94.21 ASCII Uppercase Letters
169839 4852 97.14 ASCII Lowercase Letters
231979 8115 96.50 Total
3.a. cuneiform branch with dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
231979 Characters
13837 Errors
94.04% Accuracy
451 Reject Characters
0 Suspect Markers
0 False Marks
0.19% Characters Marked
94.77% Accuracy After Correction
Ins Subst Del Errors
94 216 1395 1705 Marked
4445 3766 3921 12132 Unmarked
4539 3982 5316 13837 Total
Count Missed %Right
35888 2483 93.08 ASCII Spacing Characters
8668 965 88.87 ASCII Special Symbols
6038 800 86.75 ASCII Digits
11546 583 94.95 ASCII Uppercase Letters
169839 3690 97.83 ASCII Lowercase Letters
231979 8521 96.33 Total
3.b. cuneiform without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
231979 Characters
14426 Errors
93.78% Accuracy
451 Reject Characters
0 Suspect Markers
0 False Marks
0.19% Characters Marked
94.52% Accuracy After Correction
Ins Subst Del Errors
94 216 1395 1705 Marked
4739 3766 4216 12721 Unmarked
4833 3982 5611 14426 Total
Count Missed %Right
35888 2525 92.96 ASCII Spacing Characters
8668 969 88.82 ASCII Special Symbols
6038 802 86.72 ASCII Digits
11546 595 94.85 ASCII Uppercase Letters
169839 3924 97.69 ASCII Lowercase Letters
231979 8815 96.20 Total
--
OCR quality drops
https://bugs.launchpad.net/bugs/344790
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.
Status in Linux port of Cuneiform: New
Bug description:
OCR quality drops during porting.
Look at the result of recognition stdj4.tif, line 7, smart text format
was (stdj4.txt.initial)
mli i f r nin. Ithas
is (stdj4.txt.puma)
m li i f r nin Ithas