calibre-devs team mailing list archive
-
calibre-devs team
-
Mailing list archive
-
Message #00021
Branch lp:~llasram/calibre/oeb2lit
Kovid etc.,
I've pushed the current state of my oeb2lit code to a new launchpad
branch at lp:~llasram/calibre/oeb2lit. I don't think it's /quite/ ready
to merge with the trunk, but the basic functionality is implemented and
integrated. Issues for discussion:
- The anchor-hashing algorithm is still not yet known. Without it
links into individual HTML streams with more than 6 anchors do not
work.
- I integrated the LZX compression code by the somewhat unorthodox
method of exposing the function addresses at Python `long's then
binding them in Python with the ctypes FFI interface. This seems
reasonable to me, and greatly simplifies providing an OO interface
to the decompression capabilities, but one way or another the
compression and decompression code should be brought into parity.
- I modified the LitReader to normalize URI encoding in extracted
markup. This isn't immediately relavant for LIT-generation, but I
did it for parity with the normalization I do on oeb2lit input.
This makes extracted mark-up more technically correct, but is a
change.
- LIT-to-LIT round-tripping does not currently work without whitespace
corruption. The issue is that in LIT files -- contrary to normal
HTML rules -- all whitespace is considered relevant. To help strip
unnecessary whitespace I'm using an lxml parser with
strip_space=True. Unfortunately, this occasionally strips relevant
whitespace from LIT-extracted markup -- oops! I've got a few ideas
but haven't had a chance to play around with them yet.
So there you go. Please let me know if you have an comments so far.
-Marshall
Follow ups