cuneiform team mailing list archive

Thread
Date
Fwd: PDF output

To: cuneiform@xxxxxxxxxxxxxxxxxxx, "Jussi Pakkanen" <jpakkane@xxxxxxxxx>
From: "René Rebe" <rene.rebe@xxxxxxxxx>
Date: Sat, 6 Sep 2008 18:21:28 +0200
In-reply-to: <84c9f6b0809060918g60567847mfd4b504501c4ecc1@mail.gmail.com>
Jussi: sorry for the double post to your addres, somehow gmail managed
to merge your addres with the launchpad one, again

Hi, again,

On Sat, Sep 6, 2008 at 4:09 PM, René Rebe <rene.rebe@xxxxxxxxx> wrote:
> Hi,
>
> On Tue, Sep 2, 2008 at 1:22 PM, René Rebe <rene.rebe@xxxxxxxxx> wrote:
>> Hi all,
>>
>> I plan to add PDF writing to create searchable PDFs from cuneiform on Linux.
>>
>> So far I "only" did some light scrolling and grep'ing back and force over
>> the code, and as I have not yet fully memorized it's structure I wanted to
>> ask the ones already more familiar with the code before I start about the
>> best place to add such code.
>>
>> So far I identified: cuneiform_src/Kern/rout/src/
>>
>> In which I would start by making a copy of html.cpp to add the corresponding
>> PDF tag writeouts, using ExactImage
>> (http://www.exactcode.de/site/open_source/exactimage/)
>> for the actual PDF structure generation. (ExactImage SVN:HEAD only includes
>> very static pure image writing, but I already rewrote that part and have any
>> vector, font, image and multi-page writing in my local working copy, already).
>>
>> Any hints welcome,
>
> Ok, a debugger was not too helpful with all the pointers to handles, sigh.
>
> Anyway, I found how to get the layout bounding boxes while within the
> HTML writer:
>
>                        {
>                                char buf[256] = "";
>                                edRect r = CED_GetCharLayout(hObject);
>                                if (r.left != -1) {
>                                        sprintf (buf, "<span
> title=\"bbox %d %d %d %d\">", r.left, r.top, r.right, r.bottom);
>                                        PUT_STRING(buf);
>                                }
>                        }
> ...
>
> One issue I found during testing is, that the engine does not appear
> to generate line-breaks deterministically. For my 2 column test text
> another issue aries: sometimes (not always) hyphen are skipped from
> the output when a word break is recognized. Of course for writing a
> useful PDF some form of "soft hyphen" needs to be generated,
> especially with line break in order to format the text at the correct
> location.
>
> One workaround that immediately comes to mind is to use some form of
> post-processing, where the x position is tracked and missing line
> breaks inserted where the content flow wraps around depending on the
> writing direction. Probably soft hyphens can be inserted when a line
> break is "auto detected" in the middle of a word.
>
> I'll post more complete patches when I get somewhere there.
>
> Are there any preferences where to add this hOCR related HTML
> annotation? Conditionalized into the existing html writer, or as a
> second copy of it adding those bounding boxes and possibly some post
> processing for the issues mentioned above?
>
> Have a nice weekend,
>  René

while adding paragraph bounding boxes I had to notice that currently
all paragraphs are apparently created with the layout (bounding box)
coordinates set to -1:

With the CEDSection::CreateParagraph instrumented to log the layout I
only see calls like this:

CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1
CreateParagraph: -1, -1, -1, -1

And this -1 values stay until they reach my modified HTML writer,
writing out the BROWSE_PARAGRAPH_START.

I took a look at the paragraph creation all all places appear to not
initialize the paragraph with real values :-(

I guess I'll work with bounding boxes for each character glyph for now
and construct the lines and paragraph boxes outside of cuneiform for
the beginning :-(
References

PDF output
From: René Rebe, 2008-09-02