← Back to team overview

cuneiform team mailing list archive

Re: Patch to extend hOCR output

 

On Sun, Feb 22, 2009 at 5:01 PM, Dmitry Polevoy
<openocr.polevoy@xxxxxxxxx> wrote:

> The initial version of hOcr output was created by Rene Rebe (look at history
> of  \cuneiform-linux\cuneiform_src\Kern\rout\src\html.cpp) and I am not a
> specialist with html encoding format.

The UTF-8 encoding thing was added by me. The reason it always outputs
UTF-8 is that Unicode is the recommended encoding for HTML and it
covers all the letters so there is no need to add support for legacy
character sets. I guess we could change the html writer function so
that you can't pass output charset information to it. Currently the
only caller is the Cuneiform command line binary, which  always passes
UTF-8 as output format.



References