← Back to team overview

launchpad-dev team mailing list archive

Re: cassandra day 3

 

On Thu, Nov 18, 2010 at 10:44 PM, Stuart Bishop
<stuart.bishop@xxxxxxxxxxxxx> wrote:
> On Thu, Nov 18, 2010 at 10:28 AM, Robert Collins
> <robertc@xxxxxxxxxxxxxxxxx> wrote:
>
>> The notation I'm going to use is this:
>> 'foo' : the literal value foo.
>> foo: a variable representing foo
>> ...: Repeated things.
>> + prefixing a column name : 'has a secondary index'
>> (Thing) : this row is sorted on Thing. For instance
>> 'Address':value(timestamp) - sorted on the timestamp.
>>
>> ColumnFamily(aka Table): CF|SCF (ColumnFamily or SuperColumnFamily)
>>
>> Row-Key :   [+]ColumnName:(value)
>>
>> Remember too that every concrete column - essentially a stored cell -
>> has a timestamp on it.
>
> Have you need any way of diagramming systems? I'm finding this and the
> Riptano slides pretty unreadable, even for these toy examples.

I'm using a minor tweak on the presenters preferred style; we can
certainly start using svg's/wiki pages or some such.
I find it useful to remember that 'CF == table' - for all intents and
purposes, except the static nature of the rows in regular SQL
databases.

> oauth - The issue in PostgreSQL is the nonce handling. This would be
> in memcache now except that we are relying on atomic commits to avoid
> race conditions in the replay avoidance stuff. Cassandra will hit the
> same issue. For nonce handling, I think memcache is a better fit -
> volatile is fine so keep it fast and avoid all that disk activity, and
> if a nonce is consumed other clients need to know that immediately
> (rather than waiting for information to replicate around).

I'd like to correct the analysis here though - a write + read @ quorum
is extremely fast in cassandra: its a single write to the WAL on each
of quorum machines (e.g. 2), and reads will be coming from
MemoryTable's so will have no IO. Reading at quorum is consistent -
its not delayed by replication (Cassandra writes in parallel - doing a
write at quorum when the replication factor is set to 3 does the
following:
coordinator (node the client is attached to) determines the nodes for
the row to end up on
coordinator sends all such nodes (3) the request in parallel
coordinator waits for quorum (2) nodes to reply
coordinator responds to the client

Recommended setup has dedicated disk for the WAL - so all IO's are
linear, no seeking, and the change is guaranteed once its in the WAL.
Reads then, do the following:
coordinator (node the client is attached to) determines the nodes for
the row to come from
coordinator sends all relevant (3) nodes the request in parallel
coordinator waits for quorum (2) nodes to reply
coordinator checks that both replied with the same value, and if they
didn't it waits for more to see what the real value is
coordinator responds to the client

So in the oauth nonce case, we'd expect to be able to answer
accurately mere milliseconds apart.

Also, in terms of scaling, we can have N where N could be hundreds of
machines all cooperating to handle huge load whereas with memcache to
cope with more than one machine volume we'd need a sharding proxy in
front of it.

With that out of the way, what precisely does the nonce handling need
to do? AIUI we want reliable, transient storage for issuing a new
nonce on every API request? What happens if the nonce store were to
get reset without warning? Would all current API clients handle it
transparently? Or would they all barf.

If they'd all barf, I think we probably want to do something to avoid
that - the support / confidence impact of that sort of disruption is
(IMO) pretty significant.

> sessions - seems a decent fit. I'm not sure if the existing setup is a
> problem that needs solving though.

We wanted to pick some good sized things to think through, I think
sessions are working ok right now.

> memcache - Using memcache is essentially free because of its
> limitations. I don't think Cassandra is a suitable replacement for our
> current volatile-data-only usage of memcache. There have been some
> things we decided memcache was not suitable that Cassandra could be a
> better fit for.

Memcache isn't totally free - we spend /seconds/ chatting to it on
some pages. I need to do some rigorous testing, but the data I have so
far is that bugtask:+index is slowed down more than it is sped up by
memcache.

> Is it suitable for replacing the bulk of the Librarian?

We could, OTOH there is a SAN coming which will give our current
implementation legs.

> Disaster recovery will be an issue. We need things in place before we
> put any data we care about into it.

Absolutely; James Troup and Michael Barnett seemed reasonably
confident about the options there. IIRC there are basically several
stories.

Firstly, remaining available can be done in a variety of ways, all of
which are variations on : run multiple data centres with a network
aware topology: request datacentre local consistency on writes and
reads and trust that the other data centre won't be far behind.

Secondly, backups - there is a near-instance backup facility in
Cassandra per-node, it takes a hardlink of the current data, and you
back that up at your leisure.

Lastly, replicating into a new cluster would be something like hadoop
- a massive map across everything. I'm going to write to mdennis to
explore this a little more for the staging and qastaging querstions.

> Staging and qa systems will be interesting. I'm not sure how things
> could be integrated. I guess we would need to build a staging
> cassandra database from a snapshot taken after the PostgreSQL dump was
> taken, with missing data being ok because of 'eventually consistent'.

Thats my understanding.

> I don't see a win in replacing small systems that are not in trouble.
> We may just as easily avoid the trouble by redesigning for PG or
> memcache than  by redesigning for Cassandra. Adding another moving
> part like Cassandra introduces a lot of moving parts - too much
> overhead for the toy systems. If we want to use it, I'd want to see it
> used for a big system that could do with a performance boost.
> Publishing history in soyuz, Branch/BranchRevision/Revision in
> codehosting, *Message/Message/MessageChunk,
> LibraryFileAlias/LibraryFileContent, full text search, karma.

Thats a good point too. I think I touched on integration, but its
fairly straight forward:
to query across PG and Cassandra, we would have the PG row ids in
Cassandra - either as row keys or as a dedicated index to the natural
keys. Then we'd query as much as we can in PG, and issue a multiget or
similar to get things from Cassandra, and reduce in the appserver.

This obviously works poorly if we'd get back thousands of rows to
filter / sort from PG - so we'd need to avoid that.

There are a couple of strategies to do that.
One is to have some of the tables present in PG and in Cassandra. Then
when Cassandra can answer more efficiently, query Cassandra, when PG
can answer more efficiently query PG. For instance, we might want to
do set membership tests for BranchRevision on Cassandra, but
reductions when querying for bugs with branches in PG. If we like the
result we would migrate more and more until we have nothing that needs
to query some table in PG, when we could drop it. We could use this
without migrating in fact - just add the moving parts and get a hybrid
doing different bits really well.

Another is to migrate clusters of tables that are queried together,
where the boundaries will have small numbers of ids that we'd need to
cross-query. We could move LibraryFileContent for instance - we rarely
need to query it, even though we access LFA all the time. And
generally when accessing LFA we do so as part of branding many people
- we could move that and multiget it quite satisfactorily.

-Rob



Follow ups

References