launchpad-dev team mailing list archive

Thread
Date

Re: riptano 0-60

To: Danilo Šegan <danilo@xxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Wed, 17 Nov 2010 08:34:48 +1300
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <1289926124.2229.232.camel@bljubuntu>
Sender: robertc@xxxxxxxxxxxxxxxxx

On Wed, Nov 17, 2010 at 5:48 AM, Danilo Šegan <danilo@xxxxxxxxxxxxx> wrote:
> Hi Rob,
>
> Sorry for taking a while.  I'm doing a few things at the same time and
> have written this over the few hours so I am sure I haven't been very
> coherent throughout.

Thats fine.

>> So reads have to do more work (they have to compare) but do also
>> parallelise across nodes.
>
> Parallelising helps a lot if it includes data set partitioning as well.
> In case of Cassandra it seems to just ensure no-worse-than status
> instead, especially if you always want a definite state.

Well - an example. Say you're reading 1000 keys. Say we have 6 servers
a replication factor of 3 and want 'definite state' - that implies
reading 'at quorum'.

Each key is on 3 servers, and on average the keys will be spread homogeneously.
Quorum is 2 so we need to read 1000*2 keys to be sure about the state.
Thats 2000 reads on 6 servers - 333 keys read per server.
Compare with a single server environment where we'd read all 1000 keys
off of the one server.

>> When data is replaced, the read on a data holding node will just serve
>> the row from the memtable, or the newest sstable that has the row.
>
> Right, so update is fast.  With a denormalised model we'd still have to
> do an order of magnitude more updates than today, so I doubt we could
> see a win even there.

We could, if we had to, just throw hardware at it. Thats one of the
really nice things about cassandra. I think we'd need to go into
substantially more detail to see if thats the case though.

>> If we in the appserver needed to read-then-write, that would be a
>> little different - but its also a bit of an anti pattern in Cassandra,
>> apparently.
>
> Right, but I can't imagine an application like LP Translations, where
> you are constantly working with small bits of data (short English
> strings like "Open file..." and their translations) working in any other
> way.

So lets discriminate between 'show the user X and let them give us
back new X' and read-then-write.

An example of read-then-write would be
AUDITS.set(user, {'login', AUDITS.get(user, 'login_count') + 1)

> For example of queries that we do prior to doing very simple writes you
> can check out getPOTMsgSet*() methods in
> lib/lp/translations/model/pofile.py.

It looks like some of that would be doable using two queues - done,
pending - and a batch process to eliminate dead weight every now and
then.


>> Paraphrasing, is it:
>> result = defaultdict(set)
>> for language in all_languages:
>>     for product in products:
>>         result[language].add(product.translations[language][english_string])
>> ?
>
> Basically, yes.  Except that you'd only do it for a language at a time.
>
> Also, our model is structured slightly differently because of our
> existing use cases (though, with *relational* DBs, that doesn't make
> much difference).  See below.
>
>> I can imagine storing that normalised and ready to use all the time :)
>
> Well, normalised means probably a different thing to us in this context.
> We do more things with this data.  I.e. a normal usage pattern is:
>
> result = []
> product = product1
> for english_string in product.english_strings:
>    result.append((english_string, product.translations[language][english_string]))
>
> (or more simply,
> product.english_strings[english_string].translations[language], which is
> how our model roughly looks like today, and which is why we are having
> some issues with above queries).
>
> So, our entry points are both product.english_strings and
> product.translations.  And, 'normalised' for us today means that when
> these translations are repeated between product1 and product2
> (paraphrasing still), then
> product1.translations[ANY-LANGUAGE][english_string] is equivalent to
> product2.translations[ANY-LANGUAGE][english_string] (or, translated to
> our model, we've got "product1.english_strings[string].translations ===
> product2.english_strings[string].translations".
>
> When a translation is updated on product1, it needs to be automatically
> updated on product2.  That invalidates the option of having normalised
> data set by language, or at least makes it hard.

I should have said 'denormalised and ready to use'. It looks like
having one bucket of strings and an index per product would make
sense.

-Rob

References

riptano 0-60
From: Robert Collins, 2010-11-16
Re: riptano 0-60
From: Danilo Šegan, 2010-11-16
Re: riptano 0-60
From: Robert Collins, 2010-11-16
Re: riptano 0-60
From: Danilo Šegan, 2010-11-16