← Back to team overview

simple-scan-team team mailing list archive

[Bug 483391] Re: Extract text using optical character recognition (OCR)

 

Tesseract 3.0 has finally landed in Precise, and it has layout
recognition, which can produce hOCR files that can in turn be used by
tools such as hocr2pdf to add a (properly positioned) layer of
searchable text to a PDF (as suggested for Milestone 2).

In order to fit in nicely with Simple Scan's ease-of-use, I'd suggest not adding an extra OCR button to the toolbar, but to just perform OCR whenever scanning text documents, possibly with an option (checkbox) to deactivate that feature in the settings dialog.
The settings dialog should also contain a combobox that lists installed tesseract languages, with the user's language pre-selected. Note that tesseract has somewhat unusual abbreviations for languages.

Then, when scanning a text document, the necessary steps for producing a searchable PDF are about as follows:
- Preprocess the image by running unpaper on it.
- Run tesseract on the image (for the language selected in the settings dialog), and tell it to produce an hOCR file (instructions -- in German, but easy enough to grasp: http://adnanvatandas.wordpress.com/2010/10/28/update-tesseract-3/ )
- Run hocr2pdf to add info from the hOCR file to the PDF.

Note that I don't know if these tools must really executed, or if there
are libraries that are shipped with them and that can be invoked
instead.

I don't have much experience with Vala, so I'm afraid I can't implement
this, but I hope this draft is still somewhat helpful.

-- 
You received this bug notification because you are a member of Simple
Scan Development Team, which is the registrant for Simple Scan.
https://bugs.launchpad.net/bugs/483391

Title:
  Extract text using optical character recognition (OCR)

Status in Simple Scan:
  Triaged

Bug description:
  Simple Scan should offer a workflow to do optical character recognition (OCR) on the scanned text.
  It is to be decided what this workflow should look like, but we should do it in two steps:

  Milestone 1: some-ocr-at-all:
  Get a minimum viable product: Add a button to the interface that reads "Recognize Text", and when it is clicked, the current page is saved (in an appropriate format) to /tmp/$something and the most mature OCR tool is invoked with that file as input.

  Milestone 2: integrated-ocr:
  Make the whole thing more integrated, so that simple scan does the scanning with settings optimized for OCR, automatically applies relevant image preprocessing, allows to select the area to work on from within the application and probably allow exporting to PDF with searchable text and neat stuff like that.

  List of OCR engines / software that might be evaluated:
  	
  ocropus: http://code.google.com/p/ocropus/source/list
  Cuneiform: https://launchpad.net/cuneiform-linux
  tesseract-ocr: http://code.google.com/p/tesseract-ocr/source/list
  Ocrad: http://www.gnu.org/software/ocrad/
  OCRFeeder: https://live.gnome.org/OCRFeeder

  Original Description:
  Add a "Text" profile that automatically runs the scan through OCR and saves in .txt format

To manage notifications about this bug go to:
https://bugs.launchpad.net/simple-scan/+bug/483391/+subscriptions