← Back to team overview

cuneiform team mailing list archive

[Bug 623438] Re: Font size not correct in merged sandvich PDF

 

@Igor: I searched quite a while - don't remember ocrad explicitely now
but I am quite sure I came across it. I also found at other places (blog
posts) that cuneiform seems to be the only one producing hocr output.

I would be glad if there would be more choices. I have written a common
file converter with currently plugin using ABBYY to produce ocred pdf
and also writing a plugin for cuneiform. I would be glad if there would
be other options - I would immediately start another plugin for that
one.

@Don: Thanks, I know VLinux - I have a visually impaired friend and VLinux was also mentioned on the goinglinux podcast.
Back to topic: Regarding the sandvich PDF: ASFAIK sandvich PDF means to have the text below the image so that the text is linked to the position on the page where it belongs. This is more than just having the text as just a long string (as usually delivered if you get the OCR result as text from a TIFF without producing a PDF). In theory you could then group text columns for being read by a screenreader as required for the impaired (I know of these issues you are talking about). But as far as I know cuneiform cannot build such groups. The hocr output is positioning each single character or a whole line. I think ABBYY Finereader is currently the best out there producing really good results (but it costs money).

@Yury: What he is asking basically is: Using cuneiform + hocr2pdf -
would he have a chance to get a PDF output that using a screenreader
(for visually impaired people) would read everything in the correct
order (e.g. if you have a page with left and right column of text it
should result in reading first the left column and then the right column
and not first line of left column then first line of right column,
second line of left column and so on...)

-- 
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: Invalid
Status in “exactimage” package in Ubuntu: New

Bug description:
After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, version 0.7.4 (should be the most current version) I get a sandvich pdf that looks nice until I select text.

See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
The effect is shown in screen087.png

For another file (Test10pages.pdf) the effect is either worse - basically I cannot really select any more text to copy because I only can guess where to move with the mouse.

It looks like that the font size in the HTML is somehow not correct - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html





Follow ups

References