← Back to team overview

sikuli-driver team mailing list archive

[Question #679003]: [1.1.4] IDE: OCR Tuning

 

New question #679003 on Sikuli:
https://answers.launchpad.net/sikuli/+question/679003

I am not a developer and absolute new to Tesseract. I tried to understand the Tesseract Documentation on GizHub, but it is not clear for me what functionality of Tesseract can(should be imported/used in SikuliX  (e.g. "copy only the traineddata-files of your language into Sikulix AppData folder").
Now I want to read some foldernames in my Windows 10 Explorer when storing the output of a Web-App locally and, based on the OCR-result, change the folder or create a new subfolder. I assume that Windows 10 uses Segoe fonts. I have a German special sign in my root-folder path, the OCR-Result is: "Dieser PC > Lokaler Datentréger(Cz) > ..."), This can also be a Sikulix Issue, but I can use a workaroud for this.  

My Issue:
When embedded between a meaningless mixture of numbers and characters a lower "l" ( like Lima) allways(100%!!!) gets recognized as a pipe symbol ( | ). In addition more than 70% of upper "O" ( like "Oscar") in same scenario gets recognized as "0" (Zero) and vice versa the zero.
Zooming the size of characters in Windows Explorer to 150% didn't help. I am assuming the root cause in use different fonts.

My questions:
1. How can I tell Tesseract-OCR that it should try to recognize Segoe-fonts.
2. Until now I just added German traindata-files to the Tesseract folder. Are there some font sets to add?
3. Can I provide a blacklist of characters to Tesseract-OCR, saying that there will never be a pipe-symbol in the text.
4. What are the standard fonts of the current version of Tesseract wich is embedded in SikuliX 114. My idea is to switch (and switch back) the standard fonts of Windows Explorer compliant to Tesseract.

Thanks a lot in advance!


-- 
You received this question notification because your team Sikuli Drivers
is an answer contact for Sikuli.