← Back to team overview

cuneiform team mailing list archive

Re: [Bug 623438] Re: Font size not correct in mergedsandvich PDF

 

I am not an expert in PDF internal formats at this point.  I may need to 
start learning.  I also have an application, actually a long bash script, 
that I want to extend it's capabilities to output several scanned pages that 
have had OCR performed and merge the text with the original image in a PDF. 
The package is called speedy-ocr.

Does having a sandwhiched PDF mean that the text is then editable in Adobe 
as opposed to just attached as a searchable, structured note?  I am writing 
this script to simplify scanning and OCR functionality for the blind and 
visually impaired community.  Screen readers, Orca in this case, will need 
structured text so that the text can be read in the appropriate order, if 
possible.  I do not know yet how much of the structure can be retreived from 
cuneiform, if any.  For our purposes, having the font information is not 
necessary for most users.  They just need to be able to retreive and store 
fairly accurate text, in the correct reading order, for each page.  Is this 
type of merge different than a sandwhiched PDF?  Is this simply attached 
searchable text?

We have a distribution of Ubuntu 10.0.4 Lucid that configures several 
accessibility systems and a group of developers world wide are attempting to 
fix gnome applications for accessibility.  Most of the fixes get sent 
upstream and incorporated into Ubuntu, partly because Luke is now using the 
Vinux distribution as a testbed.  The distribution is called Vinux, and it's 
home page is vinux.org.uk.  Our repositories are also on LaunchPad.net.

Don Marang

There is just so much stuff in the world that, to me, is devoid of any real 
substance, value, and content that I just try to make sure that I am working 
on things that matter.
Dean Kamen


--------------------------------------------------
From: "Martin Wildam" <623438@xxxxxxxxxxxxxxxxxx>
Sent: Friday, September 10, 2010 4:05 AM
To: <cuneiform@xxxxxxxxxxxxxxxxxxx>
Subject: [Cuneiform] [Bug 623438] Re: Font size not correct in 
mergedsandvich PDF

> I have discussed this with somebody who is an expert in PDF and my
> current understanding is that for creating the PDF the underlying text
> behind the image displayed needs font size, spacing etc information to
> be correctly displayed in the viewer.
>
> I noticed that not only the selection in the viewer does not work
> correctly. Also a lot of words are not found using the internal search
> functionality of viewers (tested with Evince and Adobe Acrobat Reader).
>
> Side note: If I extract the full text using a PDF library I get a
> correct looking text (words separated by space, no spaces between
> words).
>
> I think that creating a correct sandvich PDF is crucial and wonder why
> not more people are interested in this. But I also think, that it is not
> easy. I think it would be necessary to get experts in OCR, experts in
> PDF and experts in fonts together to solve this. - The key missing thing
> IMHO is to get font metric (font name, size, spacing, ...) information
> when only having the bounding boxes and contained text. Therefore I
> posted also the link above which I find important.
>
> -- 
> Font size not correct in merged sandvich PDF
> https://bugs.launchpad.net/bugs/623438
> You received this bug notification because you are a member of Cuneiform
> Linux, which is the registrant for Cuneiform for Linux.
>
> Status in Linux port of Cuneiform: Invalid
>
> Bug description:
> After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, 
> version 0.7.4 (should be the most current version) I get a sandvich pdf 
> that looks nice until I select text.
>
> See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
> The effect is shown in screen087.png
>
> For another file (Test10pages.pdf) the effect is either worse - basically 
> I cannot really select any more text to copy because I only can guess 
> where to move with the mouse.
>
> It looks like that the font size in the HTML is somehow not correct - I am 
> not an expert, but this link might help you:
> http://www.emdpi.com/fontsize.html
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~cuneiform
> Post to     : cuneiform@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~cuneiform
> More help   : https://help.launchpad.net/ListHelp
>

-- 
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Cuneiform
Linux, which is the registrant for Cuneiform for Linux.

Status in Linux port of Cuneiform: Invalid
Status in “exactimage” package in Ubuntu: New

Bug description:
After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, version 0.7.4 (should be the most current version) I get a sandvich pdf that looks nice until I select text.

See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
The effect is shown in screen087.png

For another file (Test10pages.pdf) the effect is either worse - basically I cannot really select any more text to copy because I only can guess where to move with the mouse.

It looks like that the font size in the HTML is somehow not correct - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html





References