calibre-devs team mailing list archive
-
calibre-devs team
-
Mailing list archive
-
Message #00102
Re: Modularization
Hmm well my thesis has about half a chapter left (the introduction), so
hopefully it should be submitted sometime next week.
In the meantime some random thoughts:
0) The basic structure of the conversion chain is fine, though the devil is in
the details
0.5) I want to keep the Readers/Writers as simple as possible to make it easy
for third parties to add plugins for new formats in the future.
1) HTML and CSS should only be parsed once (for speed)
1.1) This will necessitate keeping parsed representations of the entire book
in memory for the complete conversion chain. For example, since the MOBI
reader creates a parsed representation of the HTML, the rest of the conversion
pipeline should be able to use that without having to re-parse. So the
template of the reader should be modified as follows:
- Accept stream or pathname
- Output path to opf file
- Also optionally output a dictionary that maps absolute path names to parsed
representations of the contents of the files. (i.e. lxml root objects and
cssutils parsed stylesheets)
2) Containers: The zipfile module (both the one in calibre and the builtin one
in python) are rather buggy when it comes to replacing files in zip archives
(that is why there is a safe_replace method in the calibre zipfile module). So
I'm not sure how practical this is going to be, though if we can make it work,
it will be cool.
3) OEBBook: My vote is for having an abstract representation of the book, not
one that is so closely tied to OEB. Tying ourselves to an internal
representation that is based on a specification we have no influence over is not
a good idea. Another reason for this is that the transformation of book ->
abstract layer -> book increases robustness (at some cost in fidelity). Given
the general philosophy of calibre, which is to accept arbitrarily bad input
and do the best job possible, this is a desirable trait. Also the abstraction
is going to be used by ebook-viewer as well which means it will need support
for things like bookmarks, annotations, history etc. That said, I haven't
really looked at OEBBook in detail, so this is not set in stone.
4) Covers:
We need a sensible way to handle covers. Covers can be of two types: redered
and reflowable. Ideally the Readers should output a covers in one or both these
types. In particular an EPUB reader should output both and remove the cover
page from the spine.
5) Command line Interface
ebook-meta inputfile [options]
should both read and write metadata, using the metadata plugin system
ebook-convert inputfile outputfile.ext [options]
The available options will change based on the type of inputfile and the type
of output file. Exactly how this is going to work for both the CLI and the GUI
is one of those devilish details
6) Administrative things
I'm going to be developing this in lp:~kovid/calibre/pluginize
so any one that wants to participate should pull from that branch.
7) Timeline
a) As a first step I will create ebook-meta. Hopefully should be done by middle
of next week.
b) Start creating the Readers. Hopefully we can arrive at a consensus on the
design of the Readers by next week.
c) Once the readers are created we can start work on the container + ebook
abstraction.
c.5) Migrate ebook-viewer to use the new ebook abstraction
d) Transforms
e) Output format: EPUB, LIT, MOBI and OEB
f) Command line interface
g) Output format: LRF
h) GUI
i) Test suite (ideally this should be developed in parallel with the rest)
Kovid.
On Wednesday 04 February 2009 13:58:20 Marshall T. Vandegrift wrote:
> Hi Kovid etc.,
>
> I'm pretty excited to start refactoring calibre to do conversions in a
> more modular fashion. How's that thesis coming? :-) I'd like to start
> discussing it though.
>
> My basic idea is that all conversion passes from the input format, to an
> internal representation of an OEB book, then to the output
> format. Obviously I think the OEBBook class would be a good candidate
> for the intermediate representation. It needs a bit more work to nicely
> support completely programmatic generation (vs OPF de-serialization) and
> add documentation and test cases, but I'm really happy with how it's
> turned out so far in terms of capturing almost everything expressible in
> OPF, serializing back out to fully spec-compliant OPF, and presenting a
> Pythonic interface to the represented information.
>
> For the flow of content in and out of the OEB representation I see four
> basic duck-types:
>
> - Readers. These accept a pathname and/or stream and return an OEB. I
> think they should also provide a default source renderer profile and
> initial "cleanup" transformation chain, which probably make the most
> sense as properties of a Reader itself vs. an individual Reader
> instance.
>
> - Containers. Provide filesystem-like access to formats which support
> such access. This isn't a core abstraction, but simplifies Readers
> which can use them.
>
> - Transforms. Accept an OEB and a conversion context (source and
> destination renderer profiles) and modify them in-place.
>
> - Writers. Accept an OEB, a convertion context, and an output
> path/stream. Write the ouput format to the output stream.
>
> The Readers, Transforms, and Writers should all expose any options they
> accept in a stackable, user-exposable fashion. Then all the current and
> future any2*s become a list of Transforms and a Writer. Win!
>
> Thoughts?
>
> -Marshall
>
> _______________________________________________
> Mailing list: https://launchpad.net/~calibre-devs
> Post to : calibre-devs@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~calibre-devs
> More help : https://help.launchpad.net/ListHelp
>
> !DSPAM:3,498a0f8675726133014772!
--
_____________________________________
Kovid Goyal MC 452-48
California Institute of Technology
1200 E California Blvd
Pasadena, CA 91125
cell : +01 626 390 8699
office: +01 626 395 6595 (449 Lauritsen)
email : kovid@xxxxxxxxxxxxxxxxxx
web : http://www.kovidgoyal.net
_____________________________________
References