calibre-devs team mailing list archive

Thread
Date

Branch lp:~llasram/calibre/oeb2lit

To: calibre-devs <calibre-devs@xxxxxxxxxxxxxxxxxxx>
From: "Marshall T. Vandegrift" <llasram@xxxxxxxxx>
Date: Tue, 9 Dec 2008 12:04:27 -0500

Kovid etc.,

I've pushed the current state of my oeb2lit code to a new launchpad
branch at lp:~llasram/calibre/oeb2lit.  I don't think it's /quite/ ready
to merge with the trunk, but the basic functionality is implemented and
integrated.  Issues for discussion:

  - The anchor-hashing algorithm is still not yet known.  Without it
    links into individual HTML streams with more than 6 anchors do not
    work.

  - I integrated the LZX compression code by the somewhat unorthodox
    method of exposing the function addresses at Python `long's then
    binding them in Python with the ctypes FFI interface.  This seems
    reasonable to me, and greatly simplifies providing an OO interface
    to the decompression capabilities, but one way or another the
    compression and decompression code should be brought into parity.

  - I modified the LitReader to normalize URI encoding in extracted
    markup.  This isn't immediately relavant for LIT-generation, but I
    did it for parity with the normalization I do on oeb2lit input.
    This makes extracted mark-up more technically correct, but is a
    change.

  - LIT-to-LIT round-tripping does not currently work without whitespace
    corruption.  The issue is that in LIT files -- contrary to normal
    HTML rules -- all whitespace is considered relevant.  To help strip
    unnecessary whitespace I'm using an lxml parser with
    strip_space=True.  Unfortunately, this occasionally strips relevant
    whitespace from LIT-extracted markup -- oops!  I've got a few ideas
    but haven't had a chance to play around with them yet.

So there you go.  Please let me know if you have an comments so far.

-Marshall

Follow ups

Re: Branch lp:~llasram/calibre/oeb2lit
From: Kovid Goyal, 2008-12-09