← Back to team overview

calibre-devs team mailing list archive

Re: Branch lp:~llasram/calibre/oeb2lit

 

On Tue, Dec 9, 2008 at 1:56 PM, Kovid Goyal <kovid@xxxxxxxxxxxxxx> wrote:

> I get a "not a branch" error when trying to check out the code. And on
> launchpad it says "This branch has not been pushed to yet"

Well, that's a bug in launchpad and/or bazaar.  I first tried to branch
from lp:calibre to lp:~llasram/calibre/oeb2lit and it ground for a bit
before erroring out.  Then I just branched from my local oeb2lit branch
to lp:~llasram/calibre/oeb2lit and it reported success.  I suppose I
should have been suspicious when I tried to merge->push the most recent
trunk changes and it said everything was already up-to-date.

> I have no problems with exposing function pointers to ctypes in
> principle, but will that technique be portable across compilers?

I don't see why it shouldn't be.  I'm giving ctypes exactly what it
would get if it found the address via a library symbol lookup.  Even if
this were on an architecture like Alpha or IA-64 with crazy large
address space trampolines it should still work just fine.

That said, I'm less pleased with it than I was on the bus this morning,
so I'll probably wrap the functions with C/Python bindings afterall.

> Why are you using strip_space? To prettify the HTML?

I meant to look up the actual option before I sent the e-mail, the
actual option being `remove_blank_text'.  The issue is that MSReader
treats all whitespace in the markup stream as relevant.  So markup which
is pretty-printed to be like this:

  <div>
    <span>Here is one span</span>
    <span>followed by another</span>
  </div>

Comes out rendered like:

    Here is one span
    followed by another.

I'm currently using `remove_blank_text' plus collapsing sequences of
whitespace to *un*pretty-print, but I'm going to try instead removing
whitespace-only `elem.text's, whitespace-only `elem.tail's of last
children, and whitespace-only `elem.tail's between `display: block'
elements.

Learning this also impacts how we do LIT-extraction.  Right now
pretty-printing LIT markup uses `remove_blank_text' to make the markup
pretty-printable, which has the aforementioned property of deforming it
in some cases.  I think the easiest, most general solution is to
"protect" any whitespace-only text with a <span/> tag.  The only
downsides are that it makes the extraction somewhat unfaithful to the
source content, and can result in spurious extra <span/>s in books which
e.g., have a trailing space at the end of every paragraph.

-Marshall



Follow ups

References