launchpad-dev team mailing list archive

Thread
Date
Re: riptano 0-60

To: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
From: Clint Byrum <clint@xxxxxxxxxx>
Date: Mon, 15 Nov 2010 23:28:50 -0800
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <AANLkTi=kMSWL2b8b2jNHZEr+B_kpc8wy7ak2DO7DBK5O@mail.gmail.com>
Great write-up Robert. A few disjointed responses below..

On Tue, 2010-11-16 at 17:37 +1300, Robert Collins wrote:
> The schema is dynamic: at the database level new columns can be
> defined without downtime, but the database has no transactions: the
> strongest guarantee you can get is that a single write group to a
> single partition will all get put into a single write-ahead-log.
> 

Probably worth it to note that each column is just a name pointing to a
string of bytes. This can be a uint64_t, or a UTF-8 string. So the
"Schema" updates are basically just writing a new row with a column
never seen before, or adding a secondary index.

Its important to note that the database does "conflict resolution" on
its own, using the timestamps you've given to each bit of data you have
inserted. 0.7 (in beta now) includes some kind of version vector support
to make this more reliable, but the cassandra wiki doesn't mention it.

Its also interesting that this conflict resolution does not happen at
write time. Basically, you read values from two places to enable "read
repair" when a node has gone away and come back. If you read the values,
and they aren't the same, the one with the newer timestamp wins.

> Counting - assigning numbers based on data in the database - is
> tricky, and there are a few techniques to do it. Running a counting
> service - a single point of failure that manages a lock and can issue
> numbers - is something we'd probably need to do to allocate bugids,
> were we to migrate to Cassandra.
> 

There's been a lot of discussion and work on this for 0.7:

https://issues.apache.org/jira/browse/CASSANDRA-1072

Basically incrementing counters atomically is wanted, and can be solved,
but hasn't been fully solved yet.

> In Cassandra, most indexes are a CF that has row keys that are either
> the key [or some named value] from another CF, and values that are the
> key into another CF. E.g. BugSubscription might have a key of bugid,
> and in every row a column called 'emailaddress' with value being the
> email address subscribed to it. I chose this deliberately to emphasis
> how we might denormalise to make calculating notifications absolutely
> trivial. When someone changes their email address, we'd find their
> subscribed bugs (via a secondary index which would index the
> emailaddress column in BugSubscription) and update those
> subscriptions.
> 

Which is the same thing one would do if they were to update the values
in a relational table with an inexed column... its just more magical
with rdbms.

> Costs of using cassandra:
>  - more servers are needed vs existing thing being replaced [because
> its less efficient and needs parallelism]
>  - we'd need to write supporting ware of some sort to automate things
> that are simple sql now, like creating indexes [change the schema,
> generate an automated script to populate the index, update our data
> definition to cause writes to the index]

Eric Evans recently posted to dev@xxxxxxxxxxxxxxxxxxxx about adding a
simple language to cassandra to free people from having to write
utilities like this. The link escapes me at the moment. Something like
YeCQL.

>  - writes need to be change from ACID - where we rollback in the event
> of error to BASE - where everything we write is correct as far as it
> goes and things get made sensible eventually. (Eventually might be
> milliseconds, but its not instant).

For critical things that need to be atomic, shared read/write locking
with something like zookeeper works to serialize writes.

>  - its a pain to package, so we'd need to gain some java glue in buildout.

The 0.7 release will be a little easier than 0.6 was upon the evaluation
I did, as the version of thrift being used is a released one, and not a
specific svn revision. The 20 or so missing java deps (with a few
missing deps of their own) would be a few weeks of work to get building
entirely from source, unless we can get an auto-maven-to-deb thingy
working soon.

>  - more operational complexity than we have today (jvm vs CPython)
> 
> Potential benefits of using cassandra:
>  - highly available, scalable platform
>  - real twisted support, should we want that - native async library support
>  - parallelism within single queries
>  - online schema changes [no downtime!]
> 
> Places where Cassandra may make sense for us [short term]:
>  - librarian storage [nb most folk doing s3-like things use simple
> files on N disks for the backing store, metadata in Cassandra : in
> that model we'd just stay with pgsql]
>  - a backend for solr/lucene, the search engine at the top of my list
> for fixing our search story (LEP/Search)

+1 for this ... search being "eventually consistent" is totally
acceptable and this makes Lucene scale to crazy data sizes IIRC.

https://github.com/tjake/Lucandra

>  - could replace memcached, which would give us a higher hit rate
> (because we would be sharing one effective cache)
> 

Not sure its a great choice here.. it is, after all, slower on reads
than on writes.
References

riptano 0-60
From: Robert Collins, 2010-11-16