calibre-devs team mailing list archive

Thread
Date

Re: Initial chm2lrf implementation!

To: calibre-devs@xxxxxxxxxxxxxxxxxxx
From: Kovid Goyal <kovid@xxxxxxxxxxxxxx>
Date: Wed, 22 Oct 2008 12:17:42 -0700
In-reply-to: <8cffb8c80810190935l4639b7e7m220c8ef47e463869@mail.gmail.com>
Organization: Caltech
User-agent: KMail/1.10.90 (Linux/2.6.26-gentoo-r1; KDE/4.1.68; i686; ; )

I've looked over you code. It seems to be fine for the most part. In the 
GetFile method you should use os.path.isabs to test for absolute paths in a  
platform neutral way. 

In _reformat you should use xml_to_unicode to convert to unicode. Also whe 
rewriting img links, you should check for existence of the new link 
destination with os.path.exists (I'm assuming that _reformat happens after the 
rest of the extraction procedure). You also use '/' as a path separator in 
several place, you should switch to using os.sep for platform neutrality 
(except when paring URLs of course). 

To move towards integrating with the rest of calibre (if you want to do that) 
you should add a get_metadata function that accepts a stream (like a file 
object) and returns a calibre.ebooks.metadata.MetaInformation object that has 
all known metadata. Look at the modules in calibre.ebooks.metadata for many 
examples.

You also need a method that accepts the path to chm file and the path to a 
temporary directory. It should extract the chm into the temporary directory, 
create an OPF file in that directory with all metadata filled in and a <spine> 
element that lists the files in linear order. See 
calibre.ebooks.metadata.opf2.OPFCreator to help with the creation of the OPF 
file. This is essentially chm2oeb

Kovid.

On Sunday 19 October 2008 09:35:22 Alex Bramley wrote:
> Hello list,
>
> Attached is a piece of code i've been messing with on-and-off for the
> last couple of weekends, which uses PyCHM/CHMlib to implement a CHM ->
> LRF converter. The output isn't *great* as of yet; the HTML inside CHM
> files is often quite nasty (lots and lots of nested tables, no
> standardisation or metadata, etc), and i've found it often just causes
> my prs-505 to reboot itself. This code works as a stand-alone script
> as long as calibre is installed -- just run "./chm2lrf.py -o
> output.lrf mychm.chm" and leave it to do it's funky thing.
>
> As my python is very rusty -- i'm a SysAdmin by trade, so generally
> hack perl ;p -- the code could probably use some tidying, any hints or
> suggestions for improvement would be much appreciated. More work is
> definitely needed to clean up the HTML extracted from the CHM file,
> but i'm not quite sure what kind of markup is permitted inside LRF
> files, so i'm not sure where to focus effort from here onwards.
> Removing tables is definitely required, I regularly get problems with
> "table too large" kind of errors from html2lrf...
>
> I've also started idling in #calibre on freenode, my nick is
> "fluffle". No-one else is there full-time as of yet, though someone
> did drop by temporarily. Come say hi! ;)
>
> --alex
>
>
> !DSPAM:3,48fb61d731531615910235!

-- 
_____________________________________

Kovid Goyal  MC 452-48
California Institute of Technology
1200 E California Blvd
Pasadena, CA 91125

cell  : +01 626 390 8699
office: +01 626 395 6595 (449 Lauritsen)
email : kovid@xxxxxxxxxxxxxxxxxx
web   : http://www.kovidgoyal.net
_____________________________________

Attachment: signature.asc
Description: This is a digitally signed message part.

References

Initial chm2lrf implementation!
From: Alex Bramley, 2008-10-19