cuneiform team mailing list archive

Thread
Date

Re: Creating searchable PDF from Cuneiform OCR results

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: "Jussi Pakkanen" <jpakkane@xxxxxxxxx>
Date: Wed, 17 Sep 2008 12:21:59 +0300
In-reply-to: <84c9f6b0809160229r6cc6fb86w84c502e70e1b55c9@mail.gmail.com>

On Tue, Sep 16, 2008 at 12:29 PM, René Rebe <rene.rebe@xxxxxxxxx> wrote:

> as I wrote earlier we worked on creating searchable PDFs from Cuneiform
> (or other) OCR results.
>
> ExactImage 0.6(.0) now comes with an revamped PDF writer and hocr2pdf
> front-end, together with a patch to cuneiform annotating each recognized
> glyph with a hOCR-like bounding box allows the creation of pretty exactly
> positioned, searchable PDF files:

This is very cool. Great work.

> Cuneiform annotated HTML patch (includes already committed <>& fix), which
> is not yet conditional. For merging it it probably should only output
> the additional
> formating based on some additional command line switch, e.g. --hocr instead of
> --html or so, but that probably requires changing some 20+ files to pass the
> information down to the point where the HTML is written:

I'll look into integrating this. Getting the hOCR/HTML switch should
be quite straightforward, since PUMA_TOHTML and ROUT_FMT_HTML are only
used in six different source files all together.

> Have fun, patches and inspiration welcome,

I see that you added line feeds after HTML tags to make the output
easier to read. There is a preprocessor macro NEW_LINE for this. Yes,
it is slightly brain-dead but should probably be used for consistency.

References

Creating searchable PDF from Cuneiform OCR results
From: René Rebe, 2008-09-16