cuneiform team mailing list archive

Thread
Date

Re: Patch to extend hOCR output

To: cuneiform@xxxxxxxxxxxxxxxxxxx
From: Dmitry Polevoy <openocr.polevoy@xxxxxxxxx>
Date: Sun, 22 Feb 2009 18:01:39 +0300
In-reply-to: <1235313763.7780.20.camel@mypride>

I think we should use DoxyGen and I make some experiments 8) I agree with
you, we need to introduce some commenting
guidelines.

Convert to UTF-8 is a question to discuss. I have no experience of using
UTF-8 source codes.

The initial version of hOcr output was created by Rene Rebe (look at history
of  \cuneiform-linux\cuneiform_src\Kern\rout\src\html.cpp) and I am not a
specialist with html encoding format.

2009/2/22 Yury V. Zaytsev <yury@xxxxxxxxxx>

> Hello, Dmitry!
>
> On Sun, 2009-02-22 at 00:16 +0300, Dmitry Polevoy wrote:
> > Patch to extend hOCR output. Lines info can be useful for OCR testing.
>
> Looks cool! Would you please clarify few things?
>
>    * What's this stuff:
>
> +/*!
> +\brief \~english Put stream bufer into buffer for OCR results.
> +       \~russian
> +                  .
> +*/
>
>      Looks like the comments are in Doxygen format (I've been using
> JavaDoc for quite some time so it's only a guess)? Also, it seems that
> Russian comments are in CP1251, which brings me to the following
> questions:
>
>        - What do you guys think about converting all of the source
> files to UTF-8? The licence statement and the comments in Russian can be
> currently read only under Russian edition of Windows (or any other
> Windows workstation which is set to use CP1251 locale which often is not
> the case).
>
> They are painful to decipher under any other system (need to iconv -f
> cp1251 -t utf8 and then recode it back) and can be easily corrupted by
> non-Russian speaking developer if the wrong encoding is set...
>
> I think we definitively need to do this before we get some UTF-8-encoded
> stuff in and it will be quite difficult to recover.
>
>        - Don't you think we need to introduce some commenting
> guidelines? If it's Doxygen, then it's Doxygen, not that I really care
> about specific choice, but I feel we need to be coherent in this regard.
>
>    * Also the encoding is only set if it's utf-8:
>
> + if (gActiveCode==ROUT_CODE_UTF8)
> + {
> +         outStrm << "<meta http-equiv=\"Content-Type\""
> +                        " content=\"text/html;charset=utf-8\" >" << endl;
> + }
>
>      And if it's not? Is there a way to put the correct encoding in?
>
> --
> Sincerely yours,
> Yury V. Zaytsev
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~cuneiform<https://launchpad.net/%7Ecuneiform>
> Post to     : cuneiform@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~cuneiform<https://launchpad.net/%7Ecuneiform>
> More help   : https://help.launchpad.net/ListHelp
>

Follow ups

Re: Patch to extend hOCR output
From: Jussi Pakkanen, 2009-03-20

References

Patch to extend hOCR output
From: Dmitry Polevoy, 2009-02-22
Re: Patch to extend hOCR output
From: Yury V. Zaytsev, 2009-02-22