launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #05658
Re: riptano 0-60
On Wed, Nov 17, 2010 at 1:13 AM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
> Heya Rob,
>
> У уто, 16. 11 2010. у 17:37 +1300, Robert Collins пише:
>> Its better at writes vs reads (because it has an append-only store
>> (which does automatic compaction - rather like bzr)). If we fit our
>> system on a single DB server *and expect to do so indefinitely* then
>> staying in a relational single-server model is ideal. (We've outgrown
>> a single-server for reads, but not for writes - and we have headroom
>> there).
>
> When you say "better at writes vs reads", I wonder if that includes
> updates: with a fully "pre-joined" data set, I can imagine it being even
> slower than reads if it doesn't simply "deprecate" the old row
> ("append-only" suggests it does). How does it actually work?
For further reading you could look at the bigtable, dynamo papers.
A basic sketch though:
each cassandra server has an in memory index of the rows it holds.
Rows are retrieved from SSTables and in MemoryTables.
An SSTable is a highly compacted and indexed file on disk - like a bzr
pack file.
A MemoryTable is essentially a hashmap.
Writes accumulate in a MemoryTable until a flush to disk is triggered via:
- an explicit api call
- too much data in the memorytable
- too much time has passed
So writes are essentially:
- write to a write-ahead-log
- add to a memory hashtable
And eventually:
- flush a memory table to disk - which is nonblocking - a worker
thread does this.
Reads then are:
- query all relevant nodes (some queries can be targeted to a few
specific nodes, others have to go to many to satisfy - e.g. scans.
- compare the results depending on the consistency level desired - 1:
any result, quorum: more than half the data holders agree, all: all
data holders agree.
So reads have to do more work (they have to compare) but do also
parallelise across nodes.
When data is replaced, the read on a data holding node will just serve
the row from the memtable, or the newest sstable that has the row.
> That relates to a specific use-case I have in mind: translations sharing
> that we do. With our current model, updating a single translation in
> one place updates it for a dozen or so "contexts" (i.e. in both Ubuntu
> Lucid and Ubuntu Maverick). It means we'd have to do a dozen updates to
> replicate the functionality with a fully denormalized model, and if
> updates are slower (they basically include a read, right?) then we'd hit
> a lot of trouble.
updates are writes - they don't (by default at least) need to read the
old data at all. And if the db servers were to read the old data, that
would be localised per-node holding the result, so - lets say we had 6
nodes (which is what a loose discussion with mdennis suggested we'd
need), then an write of a row would:
- on 3 machines add a row to the memtable
- on the coordinator, wait for 2 machines to ack that they had done the write
- return
Writing two rows would be the same as one row, twice - but the three
machines would be different : not the other three, but a
three-per-row-key hash.
If we in the appserver needed to read-then-write, that would be a
little different - but its also a bit of an anti pattern in Cassandra,
apparently.
> I can elaborate if you are interested in exploring this use case, but
> it's probably best done through a live chat.
>
> OTOH, if update performance is very good, the read performance for the
> other "direction" (where we collate all translations for a particular
> English string) would be more interesting. Basically, it's a simplistic
> "translation memory" feature: go through entire DB of different contexts
> and fetch translations for a particular language for that particular
> English string. That's a feature that's causing us mild issues with
> Postgres atm, and if reads are comparatively slower, we'd be even worse
> off.
Paraphrasing, is it:
result = defaultdict(set)
for language in all_languages:
for product in products:
result[language].add(product.translations[language][english_string])
?
I can imagine storing that normalised and ready to use all the time :)
-Rob
Follow ups
References