cuneiform team mailing list archive

Thread
Date

Re: PDF output

To: cuneiform@xxxxxxxxxxxxxxxxxxx, "Jussi Pakkanen" <jpakkane@xxxxxxxxx>
From: "René Rebe" <rene.rebe@xxxxxxxxx>
Date: Sat, 6 Sep 2008 16:29:02 +0200
In-reply-to: <84c9f6b0809060709o1b0405dye1697a19055a2c@mail.gmail.com>

Hi,

On Tue, Sep 2, 2008 at 1:22 PM, René Rebe <rene.rebe@xxxxxxxxx> wrote:
> Hi all,
>
> I plan to add PDF writing to create searchable PDFs from cuneiform on Linux.
>
> So far I "only" did some light scrolling and grep'ing back and force over
> the code, and as I have not yet fully memorized it's structure I wanted to
> ask the ones already more familiar with the code before I start about the
> best place to add such code.
>
> So far I identified: cuneiform_src/Kern/rout/src/
>
> In which I would start by making a copy of html.cpp to add the corresponding
> PDF tag writeouts, using ExactImage
> (http://www.exactcode.de/site/open_source/exactimage/)
> for the actual PDF structure generation. (ExactImage SVN:HEAD only includes
> very static pure image writing, but I already rewrote that part and have any
> vector, font, image and multi-page writing in my local working copy, already).
>
> Any hints welcome,

Ok, a debugger was not too helpful with all the pointers to handles, sigh.

Anyway, I found how to get the layout bounding boxes while within the
HTML writer:

                       {
                               char buf[256] = "";
                               edRect r = CED_GetCharLayout(hObject);
                               if (r.left != -1) {
                                       sprintf (buf, "<span
title=\"bbox %d %d %d %d\">", r.left, r.top, r.right, r.bottom);
                                       PUT_STRING(buf);
                               }
                       }
...

One issue I found during testing is, that the engine does not appear
to generate line-breaks deterministically. For my 2 column test text
another issue aries: sometimes (not always) hyphen are skipped from
the output when a word break is recognized. Of course for writing a
useful PDF some form of "soft hyphen" needs to be generated,
especially with line break in order to format the text at the correct
location.

One workaround that immediately comes to mind is to use some form of
post-processing, where the x position is tracked and missing line
breaks inserted where the content flow wraps around depending on the
writing direction. Probably soft hyphens can be inserted when a line
break is "auto detected" in the middle of a word.

I'll post more complete patches when I get somewhere there.

Are there any preferences where to add this hOCR related HTML
annotation? Conditionality into the existing html writer, or as a
second copy of it adding those bounding boxes and possibly some post
processing for the issues mentioned above?

Have a nice weekend,
 René

References

PDF output
From: René Rebe, 2008-09-02