← Back to team overview

launchpad-dev team mailing list archive

Re: cassandra day 3

 

On Thu, Nov 18, 2010 at 10:28 AM, Robert Collins
<robertc@xxxxxxxxxxxxxxxxx> wrote:

> The notation I'm going to use is this:
> 'foo' : the literal value foo.
> foo: a variable representing foo
> ...: Repeated things.
> + prefixing a column name : 'has a secondary index'
> (Thing) : this row is sorted on Thing. For instance
> 'Address':value(timestamp) - sorted on the timestamp.
>
> ColumnFamily(aka Table): CF|SCF (ColumnFamily or SuperColumnFamily)
>
> Row-Key :   [+]ColumnName:(value)
>
> Remember too that every concrete column - essentially a stored cell -
> has a timestamp on it.

Have you need any way of diagramming systems? I'm finding this and the
Riptano slides pretty unreadable, even for these toy examples.


> All in all I'm very glad Gary and I were here for the face time with
> Matthew from Riptano - we're in a very good position now in terms of
> understanding what it would take, and whether we'd want to, use
> Cassandra in some capacity going forward.
>
> For the million dollar question though - I think we probably want to
> use Cassandra for some stub systems (e.g. oauth, sessions, oopses,
> memcache replacement), as it has a much better scaling and schema
> evolution story than postgresql - but the lack of transactions and
> fundamentally different design approach needed mean that while
> Cassandras performance and scaling are very attractive, we'd be nuts
> to try and use hook it into Launchpad until our layering is sorted out
> - we'd need a dedicated layer where we could abstract out the overall
> operation vs the transaction/update logic.

oauth - The issue in PostgreSQL is the nonce handling. This would be
in memcache now except that we are relying on atomic commits to avoid
race conditions in the replay avoidance stuff. Cassandra will hit the
same issue. For nonce handling, I think memcache is a better fit -
volatile is fine so keep it fast and avoid all that disk activity, and
if a nonce is consumed other clients need to know that immediately
(rather than waiting for information to replicate around).

sessions - seems a decent fit. I'm not sure if the existing setup is a
problem that needs solving though.

oopses - Probably a better fit than PostgreSQL. Can start with the
reporting side of things if that is a problem. If we can generate the
reports we need, then we can get systems submitting directly to the DB
or via Rabbit.

memcache - Using memcache is essentially free because of its
limitations. I don't think Cassandra is a suitable replacement for our
current volatile-data-only usage of memcache. There have been some
things we decided memcache was not suitable that Cassandra could be a
better fit for.

Is it suitable for replacing the bulk of the Librarian?

Disaster recovery will be an issue. We need things in place before we
put any data we care about into it.

Staging and qa systems will be interesting. I'm not sure how things
could be integrated. I guess we would need to build a staging
cassandra database from a snapshot taken after the PostgreSQL dump was
taken, with missing data being ok because of 'eventually consistent'.

I don't see a win in replacing small systems that are not in trouble.
We may just as easily avoid the trouble by redesigning for PG or
memcache than  by redesigning for Cassandra. Adding another moving
part like Cassandra introduces a lot of moving parts - too much
overhead for the toy systems. If we want to use it, I'd want to see it
used for a big system that could do with a performance boost.
Publishing history in soyuz, Branch/BranchRevision/Revision in
codehosting, *Message/Message/MessageChunk,
LibraryFileAlias/LibraryFileContent, full text search, karma.

-- 
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/



Follow ups

References