calibre-devs team mailing list archive

Thread
Date
Re: Modularization

To: calibre-devs@xxxxxxxxxxxxxxxxxxx
From: Kovid Goyal <kovid@xxxxxxxxxxxxxx>
Date: Wed, 4 Feb 2009 15:00:29 -0800
In-reply-to: <f337701d0902041358x27eb5564lebd376eeb404475e@mail.gmail.com>
User-agent: KMail/1.11.0 (Linux/2.6.26-gentoo-r3; KDE/4.2.0; i686; ; )
Hmm well my thesis has about half a chapter left (the introduction), so 
hopefully it should be submitted sometime next week. 

In the meantime some random thoughts:

0) The basic structure of the conversion chain is fine, though the devil is in 
the details

0.5) I want to keep the Readers/Writers as simple as possible to make it easy 
for third parties to add plugins for new formats in the future.

1) HTML and CSS should only be parsed once (for speed)

1.1) This will necessitate keeping parsed representations of the entire book 
in memory for the complete conversion chain. For example, since the MOBI 
reader creates a parsed representation of the HTML, the rest of the conversion 
pipeline should be able to use that without having to re-parse. So the 
template of the reader should be modified as follows:
  - Accept stream or pathname
  - Output path to opf file
 - Also optionally output a dictionary that maps absolute path names to parsed 
representations of the contents of the files. (i.e. lxml root objects and 
cssutils parsed stylesheets)

2) Containers: The zipfile module (both the one in calibre and the builtin one 
in python) are rather buggy when it comes to replacing files in zip archives 
(that is why there is a safe_replace method in the calibre zipfile module). So 
I'm not sure how practical this is going to be, though if we can make it work, 
it will be cool. 

3) OEBBook: My vote is for having an abstract representation of the book, not 
one that is so closely tied to OEB. Tying ourselves to an internal 
representation that is based on a specification we have no influence over is not 
a good idea. Another reason for this is that the transformation of book -> 
abstract layer -> book increases robustness (at some cost in fidelity). Given 
the general philosophy of calibre, which is to accept arbitrarily bad input 
and do the best job possible, this is a desirable trait. Also the abstraction 
is going to be used by ebook-viewer as well which means it will need support 
for things like bookmarks, annotations, history etc. That said, I haven't 
really looked at OEBBook in detail, so this is not set in stone.

4) Covers: 
We need a sensible way to handle covers. Covers can be of two types: redered 
and reflowable. Ideally the Readers should output a covers in one or both these 
types. In particular an EPUB reader should output both and remove the cover 
page from the spine. 

5) Command line Interface
ebook-meta inputfile [options] 
should both read and write metadata, using the metadata plugin system

ebook-convert inputfile outputfile.ext [options]

The available options will change based on the type of inputfile and the type 
of output file. Exactly how this is going to work for both the CLI and the GUI 
is one of those devilish details

6) Administrative things

I'm going to be developing this in lp:~kovid/calibre/pluginize
so any one that wants to participate should pull from that branch. 


7) Timeline

a) As a first step I will create ebook-meta. Hopefully should be done by middle 
of next week.

b) Start creating the Readers. Hopefully we can arrive at a consensus on the 
design of the Readers by next week. 

c) Once the readers are created we can start work on the container + ebook 
abstraction.

c.5) Migrate ebook-viewer to use the new ebook abstraction

d) Transforms

e) Output format: EPUB, LIT, MOBI and OEB

f) Command line interface

g) Output format: LRF

h) GUI

i) Test suite (ideally this should be developed in parallel with the rest)


Kovid.

On Wednesday 04 February 2009 13:58:20 Marshall T. Vandegrift wrote:
> Hi Kovid etc.,
>
> I'm pretty excited to start refactoring calibre to do conversions in a
> more modular fashion. How's that thesis coming? :-) I'd like to start
> discussing it though.
>
> My basic idea is that all conversion passes from the input format, to an
> internal representation of an OEB book, then to the output
> format. Obviously I think the OEBBook class would be a good candidate
> for the intermediate representation. It needs a bit more work to nicely
> support completely programmatic generation (vs OPF de-serialization) and
> add documentation and test cases, but I'm really happy with how it's
> turned out so far in terms of capturing almost everything expressible in
> OPF, serializing back out to fully spec-compliant OPF, and presenting a
> Pythonic interface to the represented information.
>
> For the flow of content in and out of the OEB representation I see four
> basic duck-types:
>
>   - Readers. These accept a pathname and/or stream and return an OEB. I
>     think they should also provide a default source renderer profile and
>     initial "cleanup" transformation chain, which probably make the most
>     sense as properties of a Reader itself vs. an individual Reader
>     instance.
>
>   - Containers. Provide filesystem-like access to formats which support
>     such access. This isn't a core abstraction, but simplifies Readers
>     which can use them.
>
>   - Transforms. Accept an OEB and a conversion context (source and
>     destination renderer profiles) and modify them in-place.
>
>   - Writers. Accept an OEB, a convertion context, and an output
>     path/stream. Write the ouput format to the output stream.
>
> The Readers, Transforms, and Writers should all expose any options they
> accept in a stackable, user-exposable fashion.  Then all the current and
> future any2*s become a list of Transforms and a Writer.  Win!
>
> Thoughts?
>
> -Marshall
>
> _______________________________________________
> Mailing list: https://launchpad.net/~calibre-devs
> Post to     : calibre-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~calibre-devs
> More help   : https://help.launchpad.net/ListHelp
>
> !DSPAM:3,498a0f8675726133014772!

-- 
_____________________________________

Kovid Goyal  MC 452-48
California Institute of Technology
1200 E California Blvd
Pasadena, CA 91125

cell  : +01 626 390 8699
office: +01 626 395 6595 (449 Lauritsen)
email : kovid@xxxxxxxxxxxxxxxxxx
web   : http://www.kovidgoyal.net
_____________________________________
References

Modularization
From: Marshall T. Vandegrift, 2009-02-04