cuneiform team mailing list archive

Thread
Date

Hocr output status and identified improvements.

To: "cuneiform@xxxxxxxxxxxxxxxxxxx" <cuneiform@xxxxxxxxxxxxxxxxxxx>
From: julien <julien@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 1 Oct 2009 11:15:42 +0000
Accept-language: sv-SE, en-US
Thread-index: AcpCiINBp7D/9Q50Qza061Icz3PlAA==
Thread-topic: Hocr output status and identified improvements.

I am currently using v0.8.

By testing and looking at the code I concluded that the hocr output is capable of producing these tags:
<br>,<p>,<img>, <b>,<i>,<u> and <span>.

The span is used to output on a per character basis and the associated bounding box with the character is an attribute to span.

I am currently facing an issue with not being able to split the characters into strings.
While possible to implement a heuristic on top (e.g. "finding whitespace" by looking at "connected bboxes").
A better option would be to modify the hocr output code to support the ocr_line standard.

I was about to start modifying the code when I noticed there was a patch to handle ocr_line.
https://lists.launchpad.net/cuneiform/msg00269.html

However, it seems this patch was not merged into v0.8.
Is there a reason for why it was not merged into v0.8?
What would be the most appropiate way for me to contribute back any effort?
(I suppose starting of from v0.8 and then once stable, ask to have it merged?)

As for what reference should be used to standardize the hocr, would it be the following reference?
http://docs.google.com/View?docid=dfxcv4vc_67g844kf
(it is from 2007).

Regards
Julien