cuneiform team mailing list archive

Thread
Date

Re: PDF output

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: "René Rebe" <rene.rebe@xxxxxxxxx>
Date: Thu, 4 Sep 2008 11:33:53 +0200
In-reply-to: <84c9f6b0809020422x33f0f5a9l66feb6fff78bfff1@mail.gmail.com>

Hi,

> > In which I would start by making a copy of html.cpp to add the corresponding
> > PDF tag writeouts, using ExactImage
> > (http://www.exactcode.de/site/open_source/exactimage/)
> > for the actual PDF structure generation. (ExactImage SVN:HEAD only includes
> > very static pure image writing, but I already rewrote that part and have any
> > vector, font, image and multi-page writing in my local working copy, already).
> >
> > Any hints welcome,
>
> Exactimage is GPL code. Linking to it is legal but would contaminate
> Cuneiform (which is BSD). For this reason I can't accept it into
> trunk.

Yes, that came into my mind as well after the post. Guess I overlooked
that part as I get in touch with BSD licensed code so seldomly :-)

...

>The easiest way to get PDF output is to convert the RTF output to PDF.
> I would imagine that there are already programs that do this. I'm also
> looking into adding the layout information to the HTML exporter using
> hOCR format. Having a hOCR -> PDF converter would probably be
> beneficial outside Cuneiform as well.

Yes I agree about the hOCR point. However, I think RTF will miss the
exact positioning for a PDF writer to layer the text behind the image
for the final PDF.

I'll now add a hOCR (HTML) parser for the PDF writer of ExactImage,
so that one can feed the formating stream with boundary boxes from
"any" hOCR program and obtain a searchable PDF.

References

PDF output
From: René Rebe, 2008-09-02