← Back to team overview

cuneiform team mailing list archive

Re: Creating searchable PDF from Cuneiform OCR results

 

On Tue, Sep 16, 2008 at 12:29 PM, René Rebe <rene.rebe@xxxxxxxxx> wrote:

> as I wrote earlier we worked on creating searchable PDFs from Cuneiform
> (or other) OCR results.
>
> ExactImage 0.6(.0) now comes with an revamped PDF writer and hocr2pdf
> front-end, together with a patch to cuneiform annotating each recognized
> glyph with a hOCR-like bounding box allows the creation of pretty exactly
> positioned, searchable PDF files:

This is very cool. Great work.

> Cuneiform annotated HTML patch (includes already committed <>& fix), which
> is not yet conditional. For merging it it probably should only output
> the additional
> formating based on some additional command line switch, e.g. --hocr instead of
> --html or so, but that probably requires changing some 20+ files to pass the
> information down to the point where the HTML is written:

I'll look into integrating this. Getting the hOCR/HTML switch should
be quite straightforward, since PUMA_TOHTML and ROUT_FMT_HTML are only
used in six different source files all together.

> Have fun, patches and inspiration welcome,

I see that you added line feeds after HTML tags to make the output
easier to read. There is a preprocessor macro NEW_LINE for this. Yes,
it is slightly brain-dead but should probably be used for consistency.



References