← Back to team overview

sikuli-driver team mailing list archive

[Bug 710586] Re: X 1.0rc3: Region.text() -- known problems and needed improvements

 

Let me briefly summarize the progress on the OCR research we are doing
for Sikuli.

1. Recently I've implemented a new OCR algorithm designed for small
screen text (which is from a paper "Recognition of Screen-Rendered
Text", ICPR '06). However, it turns out this algorithm doesn't perform
so well as the authors claimed in the paper. It's even worse than
Tesseract OCR, so right now we will continue using Tesseract as Sikuli's
OCR engine.

2. We are migrating from Tesseract 2 to Tesseract 3. One significant
advantage of Tesseract 3 is that it supports many more languages such as
Chinese and Japanese. We are also working on making a simple OCR trainer
so Sikuli users can train the OCR engine using the fonts installed on
their systems.

3. Improving OCR performance is very tricky. Lots of parameters and
preprocessing could be done to improve it. We put a collection of
screenshots with labeled ground truth in our source repo, so  everyone
can try to improve the OCR algorithm, and simply run the tests to know
if it really gets better or worse. Welcome to fork our code and try any
possible improvements, or even provide more labeled screenshots to make
our data set more diverse.

-- 
You received this bug notification because you are a member of Sikuli
Drivers, which is subscribed to Sikuli.
https://bugs.launchpad.net/bugs/710586

Title:
  X 1.0rc3: Region.text() -- known problems and needed improvements

Status in Sikuli:
  In Progress

Bug description:
  ******* this report is a summary of known problems and feature
  requests

  The text recognition feature (OCR - Region.text()) together with the
  possibility to find text in an image is still experimental and under
  developement.

  This are currently reported bugs:
  bug 777660: text recognition errors with some fonts
  bug 783082: [request] want font parameters for text recognition
  bug 735434: Text extraction from Images fails in some cases on colored backgrounds
  bug 695616: Inconsistency in text recognition and matching, especially with integers-as-text!
  bug 695650: find(text).text() does not return same text
  bug 701005: text() always returns text with trailing x'200A20'
  bug 701012: text() does not return all intervening blanks, add's others
  bug 795391: [request] OCR/tesseract: allow new training sets for other languages and more tesseract features

  Other experienced oddities
  -- there are problems with text, that is not in english language
  -- very small and very large fonts may not work
  -- multiline text makes problems
  -- intervening/preceding/trailing grafics and symbols are tried to be interpreted as text

  Tip when using Region.text():
  Currently you get the best results, when the region represents only one line of text and only contains text (no graphics/symbols) in english language. If you can influence it: make the text as large as possible.

  -- additional information:
  Internally the tesseract OCR engine (http://code.google.com/p/tesseract-ocr/) is used.
  So their restrictions apply (e.g. minimum size of font, ...).
  Information can be found on their Wiki.

To manage notifications about this bug go to:
https://bugs.launchpad.net/sikuli/+bug/710586/+subscriptions


References