← Back to team overview

launchpad-dev team mailing list archive

Re: riptano 0-60

 

On Wed, Nov 17, 2010 at 1:13 AM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
> Heya Rob,
>
> У уто, 16. 11 2010. у 17:37 +1300, Robert Collins пише:
>> Its better at writes vs reads (because it has an append-only store
>> (which does automatic compaction - rather like bzr)). If we fit our
>> system on a single DB server *and expect to do so indefinitely* then
>> staying in a relational single-server model is ideal. (We've outgrown
>> a single-server for reads, but not for writes - and we have headroom
>> there).
>
> When you say "better at writes vs reads", I wonder if that includes
> updates: with a fully "pre-joined" data set, I can imagine it being even
> slower than reads if it doesn't simply "deprecate" the old row
> ("append-only" suggests it does).  How does it actually work?

For further reading you could look at the bigtable, dynamo papers.

A basic sketch though:
each cassandra server has an in memory index of the rows it holds.
Rows are retrieved from SSTables and in MemoryTables.
An SSTable is a highly compacted and indexed file on disk - like a bzr
pack file.
A MemoryTable is essentially a hashmap.
Writes accumulate in a MemoryTable until a flush to disk is triggered via:
 - an explicit api call
 - too much data in the memorytable
 - too much time has passed

So writes are essentially:
 - write to a write-ahead-log
 - add to a memory hashtable

And eventually:
 - flush a memory table to disk - which is nonblocking - a worker
thread does this.

Reads then are:
 - query all relevant nodes (some queries can be targeted to a few
specific nodes, others have to go to many to satisfy - e.g. scans.
 - compare the results depending on the consistency level desired - 1:
any result, quorum: more than half the data holders agree, all: all
data holders agree.

So reads have to do more work (they have to compare) but do also
parallelise across nodes.

When data is replaced, the read on a data holding node will just serve
the row from the memtable, or the newest sstable that has the row.

> That relates to a specific use-case I have in mind: translations sharing
> that we do.  With our current model, updating a single translation in
> one place updates it for a dozen or so "contexts" (i.e. in both Ubuntu
> Lucid and Ubuntu Maverick).  It means we'd have to do a dozen updates to
> replicate the functionality with a fully denormalized model, and if
> updates are slower (they basically include a read, right?) then we'd hit
> a lot of trouble.

updates are writes - they don't (by default at least) need to read the
old data at all. And if the db servers were to read the old data, that
would be localised per-node holding the result, so - lets say we had 6
nodes (which is what a loose discussion with mdennis suggested we'd
need), then an write of a row would:
 - on 3 machines add a row to the memtable
 - on the coordinator, wait for 2 machines to ack that they had done the write
 - return
Writing two rows would be the same as one row, twice - but the three
machines would be different : not the other three, but a
three-per-row-key hash.

If we in the appserver needed to read-then-write, that would be a
little different - but its also a bit of an anti pattern in Cassandra,
apparently.

> I can elaborate if you are interested in exploring this use case, but
> it's probably best done through a live chat.
>
> OTOH, if update performance is very good, the read performance for the
> other "direction" (where we collate all translations for a particular
> English string) would be more interesting.  Basically, it's a simplistic
> "translation memory" feature: go through entire DB of different contexts
> and fetch translations for a particular language for that particular
> English string.  That's a feature that's causing us mild issues with
> Postgres atm, and if reads are comparatively slower, we'd be even worse
> off.

Paraphrasing, is it:
result = defaultdict(set)
for language in all_languages:
    for product in products:
        result[language].add(product.translations[language][english_string])
?

I can imagine storing that normalised and ready to use all the time :)

-Rob



Follow ups

References