← Back to team overview

sikuli-driver team mailing list archive

Re: [Question #632761]: TesseractOCR ignoring white-spaces?

 

Question #632761 on Sikuli changed:
https://answers.launchpad.net/sikuli/+question/632761

Andrew Grabov posted a new comment:
Just some update on this topic,

so I was able to connect with the external Tesseract and I would say
that it works absolutely fine!

I haven't noticed any speed issues of Sikulx script processing, maybe
because the laptop that runs script is pretty powerful and has SSD
drive, thus operations with files don't have noticeable effect on
performance.

As per OCR quality: it is noticeably better. Basically I have compiled
both versions (3.05 and 4), and both work fine. What is good that once
you have separate installation of Teserract you can have full control
over it.

And some code snippits that responsible for the texts extraction from
the image:

##############################################

TESSERACT_EXEC = "\"C:\\Program Files\\tesseract3\\tesseractmain.exe\"";
TESSERACT_TESSDATA = "\"C:\\Program Files\\tesseract3\\tessdata\"";

def getText(region):
    pathToImg = Screen().capture(region).getFilename()    
    output = run(TESSERACT_EXEC + " " + pathToImg + " " + pathToImg + " " + "--tessdata-dir " + TESSERACT_TESSDATA)
    return (readFile(pathToImg + ".txt"))


def readFile(pathToFile):
    with open(pathToFile, 'r') as file:
        #return file.read()
        return file.read().replace('\n', '')   

##############################################

Works like a charm! ;)

-- 
You received this question notification because your team Sikuli Drivers
is an answer contact for Sikuli.