← Back to team overview

cuneiform team mailing list archive

Re: Hocr output status and identified improvements.

 

FYI, these are my plans, if you have suggestions please let me know.

My goal is to standardize the hocr output as much as possible.
>From what I have understood it originates from the authors of ocropus.
The standard refered to from ocropus is:
http://docs.google.com/View?docid=dfxcv4vc_67g844kf

Goal: parsing should be the same regardless if it is ocropus or cuneiform producing the hocr.

Current issues with the output from cuneiforms html option:
* the text in <span class='ocr_line'> is empty
* the characters are instead children to the ocr_line tag, with their own bboxes
This is against the hocr spec.

This prevents the output of whitespace (string separators).

Some other minor issues exists as well. I have also noticed a probability attribute in the cuneiform source.
I hope to be able to put in the (negative) log probability per character in the output.

I will work in this branch.
lp:~julien-student/cuneiform-linux/hocroutput


Regards
Julien

________________________________________
Från: julien
Skickat: den 1 oktober 2009 18:17
Till: cuneiform@xxxxxxxxxxxxxxxxxxx
Ämne: RE: [Cuneiform] Hocr output status and identified improvements.

I created a branch, hope I did it right.
lp:~julien-student/cuneiform-linux/hocroutput

I have tested and compared the output with the v0.8 release.
Visually the bounding boxes are correct. No text is missing. The typography output is exactly the same (bold, italics etc). I tried on 4 different images.
I would say it could be merged with the main branch.

I used the minimum amount required from patch by Dmitry Polevoy, to make this work.
It ended up being only the html.cpp file.
https://lists.launchpad.net/cuneiform/msg00269.html

Regards
Julien

________________________________________
Från: Jussi Pakkanen [jpakkane@xxxxxxxxx]
Skickat: den 1 oktober 2009 13:10
Till: julien
Cc: cuneiform@xxxxxxxxxxxxxxxxxxx
Ämne: Re: [Cuneiform] Hocr output status and identified improvements.

On Thu, Oct 1, 2009 at 1:58 PM, julien <julien@xxxxxxxxxxxxxxxxxxx> wrote:

> I was about to start modifying the code when I noticed there was a patch to handle ocr_line.
> https://lists.launchpad.net/cuneiform/msg00269.html
>
> However, it seems this patch was not merged into v0.8.
> Is there a reason for why it was not merged into v0.8?

I was under the impression that he was going to submit an ever better
version. Since nothing happened I probably just forgot about it.

> What would be the most appropiate way for me to contribute back any effort?
> (I suppose starting of from v0.8 and then once stable, ask to have it merged?)

Check out the newest code from Bazaar and work against that. Plain
patches against bzr head are fine. You can use the fancier options
that bzr offers if you feel like it.

> As for what reference should be used to standardize the hocr, would it be the following reference?
> http://docs.google.com/View?docid=dfxcv4vc_67g844kf

I actually don't know. Anyone?



References