sikuli-driver team mailing list archive

Thread
Date

Re: [Question #632761]: [HowTo] use external Tesseract install for OCR (version 3+) --- workaround

To: sikuli-driver@xxxxxxxxxxxxxxxxxxx
From: RaiMan <question632761@xxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 04 Jun 2017 10:09:59 -0000
Reply-to: question632761@xxxxxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

Question #632761 on Sikuli changed:
https://answers.launchpad.net/sikuli/+question/632761

Summary changed to:
[HowTo] use external Tesseract install for OCR (version 3+)  --- workaround

Description changed to:
********* workaround for all who are somewhat unsatisfied with quality and handling of the builtin Tesseract OCR support based on version 2 features (Thanks to Andrew Grabov)
----------------------------------------------------------------------
so I was able to connect with the external Tesseract and I would say that it works absolutely fine!

I haven't noticed any speed issues of Sikulx script processing, maybe
because the laptop that runs script is pretty powerful and has SSD
drive, thus operations with files don't have noticeable effect on
performance.

As per OCR quality: it is noticeably better. Basically I have compiled
both versions (3.05 and 4), and both work fine. What is good that once
you have separate installation of Teserract you can have full control
over it.

And some code snippits that responsible for the texts extraction from
the image:

##############################################

TESSERACT_EXEC = "\"C:\\Program Files\\tesseract3\\tesseractmain.exe\"";
TESSERACT_TESSDATA = "\"C:\\Program Files\\tesseract3\\tessdata\"";

def getText(region):
    pathToImg = Screen().capture(region).getFilename()
    output = run(TESSERACT_EXEC + " " + pathToImg + " " + pathToImg + " " + "--tessdata-dir " + TESSERACT_TESSDATA)
    return (readFile(pathToImg + ".txt"))

def readFile(pathToFile):
    with open(pathToFile, 'r') as file:
        #return file.read()
        return file.read().replace('\n', '')

##############################################

Works like a charm! ;)

--------------------------------------------------------------------------------------------------------------------

Hi everyone,

could you please help me to figure out why Tesseract ignores white-spaces?
The text itself is recognised fine, but whitesapce between two words is missing.
I have already increased the space between words, but seems like it just set to ignore them.

Can you suggest which tesseract params try to adjust?
I have tried several from here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
but still no success...

Thank you in advance!

-- 
You received this question notification because your team Sikuli Drivers
is an answer contact for Sikuli.