simple-scan-team team mailing list archive

Thread
Date

[Bug 483391] Re: Extract text using optical character recognition (OCR)

To: simple-scan-team@xxxxxxxxxxxxxxxxxxx
From: papukaija <483391@xxxxxxxxxxxxxxxxxx>
Date: Sat, 25 Feb 2012 23:24:02 -0000
Reply-to: Bug 483391 <483391@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

** Tags added: patch

-- 
You received this bug notification because you are a member of Simple
Scan Development Team, which is the registrant for Simple Scan.
https://bugs.launchpad.net/bugs/483391

Title:
  Extract text using optical character recognition (OCR)

Status in Simple Scan:
  Triaged

Bug description:
  Simple Scan should offer a workflow to do optical character recognition (OCR) on the scanned text.
  It is to be decided what this workflow should look like, but we should do it in two steps:

  Milestone 1: some-ocr-at-all:
  Get a minimum viable product: Add a button to the interface that reads "Recognize Text", and when it is clicked, the current page is saved (in an appropriate format) to /tmp/$something and the most mature OCR tool is invoked with that file as input.

  Milestone 2: integrated-ocr:
  Make the whole thing more integrated, so that simple scan does the scanning with settings optimized for OCR, automatically applies relevant image preprocessing, allows to select the area to work on from within the application and probably allow exporting to PDF with searchable text and neat stuff like that.

  List of OCR engines / software that might be evaluated:
  	
  ocropus: http://code.google.com/p/ocropus/source/list
  Cuneiform: https://launchpad.net/cuneiform-linux
  tesseract-ocr: http://code.google.com/p/tesseract-ocr/source/list
  Ocrad: http://www.gnu.org/software/ocrad/
  OCRFeeder: https://live.gnome.org/OCRFeeder

  Original Description:
  Add a "Text" profile that automatically runs the scan through OCR and saves in .txt format

To manage notifications about this bug go to:
https://bugs.launchpad.net/simple-scan/+bug/483391/+subscriptions