← Back to team overview

calibre-devs team mailing list archive

Re: Initial chm2lrf implementation!

 

Hi Alex,

Thanks for the code, I'll look at it in the next couple of days and give you 
my suggestions. One way to do an end run around bad HTML is to convert to EPUB 
instead of LRF, at least if you have a 505. 

One quick suggestion. I see you error out on encoding detection problems. 
BeautifulSoup's encoding detection is somewhat sub standard. Try using the 
xml_to_unicode function from calibre.ebooks.chardet instead. 

Once the code stabilizes, I'll be happy to include it with calibre. 

Kovid.

On Sunday 19 October 2008 09:35:22 Alex Bramley wrote:
> Hello list,
>
> Attached is a piece of code i've been messing with on-and-off for the
> last couple of weekends, which uses PyCHM/CHMlib to implement a CHM ->
> LRF converter. The output isn't *great* as of yet; the HTML inside CHM
> files is often quite nasty (lots and lots of nested tables, no
> standardisation or metadata, etc), and i've found it often just causes
> my prs-505 to reboot itself. This code works as a stand-alone script
> as long as calibre is installed -- just run "./chm2lrf.py -o
> output.lrf mychm.chm" and leave it to do it's funky thing.
>
> As my python is very rusty -- i'm a SysAdmin by trade, so generally
> hack perl ;p -- the code could probably use some tidying, any hints or
> suggestions for improvement would be much appreciated. More work is
> definitely needed to clean up the HTML extracted from the CHM file,
> but i'm not quite sure what kind of markup is permitted inside LRF
> files, so i'm not sure where to focus effort from here onwards.
> Removing tables is definitely required, I regularly get problems with
> "table too large" kind of errors from html2lrf...
>
> I've also started idling in #calibre on freenode, my nick is
> "fluffle". No-one else is there full-time as of yet, though someone
> did drop by temporarily. Come say hi! ;)
>
> --alex
>
>
> !DSPAM:3,48fb61d731531615910235!

-- 
_____________________________________

Kovid Goyal  MC 452-48
California Institute of Technology
1200 E California Blvd
Pasadena, CA 91125

cell  : +01 626 390 8699
office: +01 626 395 6595 (449 Lauritsen)
email : kovid@xxxxxxxxxxxxxxxxxx
web   : http://www.kovidgoyal.net
_____________________________________

Attachment: signature.asc
Description: This is a digitally signed message part.


Follow ups

References