← Back to team overview

cuneiform team mailing list archive

[Bug 344790] Re: OCR quality drops

 

I test 3 version of cuneiform: official free version (from openocr.org site), fro cuneiform and refactoring branch with bash script from https://bugs.launchpad.net/cuneiform-linux/+bug/344790/comments/5.
To test I use 106 files with english text without images and tables from 3b.tgz archive. If you wont I can distribute list of it. 

The results:
1.a. Official  with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979   Characters                    
   11151   Errors                        
   95.19%  Accuracy                      

     461   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.20%  Characters Marked
   95.83%  Accuracy After Correction

     Ins    Subst      Del   Errors
      91      239     1143     1473   Marked
    3052     3511     3115     9678   Unmarked
    3143     3750     4258    11151   Total   

   Count   Missed   %Right
   35888      740    97.94   ASCII Spacing Characters
    8668      662    92.36   ASCII Special Symbols   
    6038      904    85.03   ASCII Digits            
   11546      551    95.23   ASCII Uppercase Letters 
  169839     4036    97.62   ASCII Lowercase Letters 
  231979     6893    97.03   Total                   

1.b. Official without dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1                
-----------------------------------------                
  231979   Characters                                    
   10879   Errors                                        
   95.31%  Accuracy                                      

     483   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.21%  Characters Marked
   95.89%  Accuracy After Correction

     Ins    Subst      Del   Errors
     130      196     1022     1348   Marked
    3217     3304     3010     9531   Unmarked
    3347     3500     4032    10879   Total   

   Count   Missed   %Right
   35888      751    97.91   ASCII Spacing Characters
    8668      662    92.36   ASCII Special Symbols   
    6038      895    85.18   ASCII Digits            
   11546      542    95.31   ASCII Uppercase Letters 
  169839     3997    97.65   ASCII Lowercase Letters 
  231979     6847    97.05   Total     

2.a. refactoring with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1                           
-----------------------------------------                           
  231979   Characters                                               
   13086   Errors                                                   
   94.36%  Accuracy                                                 

     456   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.20%  Characters Marked
   95.03%  Accuracy After Correction

     Ins    Subst      Del   Errors
      99      227     1242     1568   Marked
    4592     2830     4096    11518   Unmarked
    4691     3057     5338    13086   Total   

   Count   Missed   %Right
   35888      944    97.37   ASCII Spacing Characters
    8668      733    91.54   ASCII Special Symbols   
    6038      846    85.99   ASCII Digits            
   11546      629    94.55   ASCII Uppercase Letters 
  169839     4596    97.29   ASCII Lowercase Letters 
  231979     7748    96.66   Total    
2.b. refactoring without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1                             
-----------------------------------------                             
  231979   Characters                                                 
   13333   Errors                                                     
   94.25%  Accuracy                                                   

     478   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.21%  Characters Marked
   94.97%  Accuracy After Correction

     Ins    Subst      Del   Errors
     133      247     1293     1673   Marked
    4805     2930     3925    11660   Unmarked
    4938     3177     5218    13333   Total   

   Count   Missed   %Right
   35888      994    97.23   ASCII Spacing Characters
    8668      747    91.38   ASCII Special Symbols   
    6038      853    85.87   ASCII Digits            
   11546      669    94.21   ASCII Uppercase Letters 
  169839     4852    97.14   ASCII Lowercase Letters 
  231979     8115    96.50   Total  

3.a. cuneiform branch with dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1                                  
-----------------------------------------                                  
  231979   Characters                                                      
   13837   Errors                                                          
   94.04%  Accuracy                                                        

     451   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.19%  Characters Marked
   94.77%  Accuracy After Correction

     Ins    Subst      Del   Errors
      94      216     1395     1705   Marked
    4445     3766     3921    12132   Unmarked
    4539     3982     5316    13837   Total   

   Count   Missed   %Right
   35888     2483    93.08   ASCII Spacing Characters
    8668      965    88.87   ASCII Special Symbols   
    6038      800    86.75   ASCII Digits            
   11546      583    94.95   ASCII Uppercase Letters 
  169839     3690    97.83   ASCII Lowercase Letters 
  231979     8521    96.33   Total  

3.b. cuneiform without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1                                    
-----------------------------------------                                    
  231979   Characters                                                        
   14426   Errors                                                            
   93.78%  Accuracy                                                          

     451   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.19%  Characters Marked
   94.52%  Accuracy After Correction

     Ins    Subst      Del   Errors
      94      216     1395     1705   Marked
    4739     3766     4216    12721   Unmarked
    4833     3982     5611    14426   Total   

   Count   Missed   %Right
   35888     2525    92.96   ASCII Spacing Characters
    8668      969    88.82   ASCII Special Symbols   
    6038      802    86.72   ASCII Digits            
   11546      595    94.85   ASCII Uppercase Letters 
  169839     3924    97.69   ASCII Lowercase Letters 
  231979     8815    96.20   Total

-- 
OCR quality drops
https://bugs.launchpad.net/bugs/344790
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: New

Bug description:
OCR quality drops during porting.
Look at the result of recognition stdj4.tif, line 7, smart text format

was (stdj4.txt.initial)
mli i f r nin. Ithas

is (stdj4.txt.puma)
m li i f r nin Ithas