← Back to team overview

launchpad-dev team mailing list archive

Re: cassandra day 3

 

On Mon, Nov 22, 2010 at 7:33 AM, Robert Collins
<robert.collins@xxxxxxxxxxxxx> wrote:
> On Thu, Nov 18, 2010 at 10:44 PM, Stuart Bishop
> <stuart.bishop@xxxxxxxxxxxxx> wrote:
>> On Thu, Nov 18, 2010 at 10:28 AM, Robert Collins
>> <robertc@xxxxxxxxxxxxxxxxx> wrote:


>> oauth - The issue in PostgreSQL is the nonce handling. This would be
>> in memcache now except that we are relying on atomic commits to avoid
>> race conditions in the replay avoidance stuff. Cassandra will hit the
>> same issue. For nonce handling, I think memcache is a better fit -
>> volatile is fine so keep it fast and avoid all that disk activity, and
>> if a nonce is consumed other clients need to know that immediately
>> (rather than waiting for information to replicate around).
>
> I'd like to correct the analysis here though - a write + read @ quorum
> is extremely fast in cassandra: its a single write to the WAL on each
> of quorum machines (e.g. 2), and reads will be coming from
> MemoryTable's so will have no IO. Reading at quorum is consistent -
> its not delayed by replication (Cassandra writes in parallel - doing a
> write at quorum when the replication factor is set to 3 does the
> following:
> coordinator (node the client is attached to) determines the nodes for
> the row to end up on
> coordinator sends all such nodes (3) the request in parallel
> coordinator waits for quorum (2) nodes to reply
> coordinator responds to the client

It looks like Cassandra cannot be as fast as memcached due to the two
hops. It could get close if the request is streamed by the coordinator
while it is arriving.

> Recommended setup has dedicated disk for the WAL - so all IO's are
> linear, no seeking, and the change is guaranteed once its in the WAL.
> Reads then, do the following:
> coordinator (node the client is attached to) determines the nodes for
> the row to come from
> coordinator sends all relevant (3) nodes the request in parallel
> coordinator waits for quorum (2) nodes to reply
> coordinator checks that both replied with the same value, and if they
> didn't it waits for more to see what the real value is
> coordinator responds to the client

But a more in the way of resources. At the moment we have 5 (?)
instances reusing spare capacity on the appservers. If we have low
writes, we could use the appservers as Cassandra servers since their
(slow) disk is otherwise innactive. I suspect we are looking at at
least three new servers though? Plus staging?


> So in the oauth nonce case, we'd expect to be able to answer
> accurately mere milliseconds apart.

> Also, in terms of scaling, we can have N where N could be hundreds of
> machines all cooperating to handle huge load whereas with memcache to
> cope with more than one machine volume we'd need a sharding proxy in
> front of it.

Memcached doesn't need this. We already have 5 instances running. Load
is naturally distributed around the nodes.

> With that out of the way, what precisely does the nonce handling need
> to do? AIUI we want reliable, transient storage for issuing a new
> nonce on every API request? What happens if the nonce store were to
> get reset without warning? Would all current API clients handle it
> transparently? Or would they all barf.

The high traffic is on the OAuthNonce table. As I understand it, for
every OAuth request we get we store the nonce with a timestamp to
protect against replay attacks. If we lose an entry, it isn't really a
drama - it just means that that nonce could be reused (for a practical
attack, you would need to sniff the request and engineer the loss of
the nonce and replay it with a request within a particular timeframe.
There are already other attacks that would be easier to perform.)

The reason this is still in PG is that we sometimes need to retry
requests - optimistic conflict resolution. If we store the used nonce
in a non-transactional store and the request is retried, the retried
request will always fail because the nonce has been flagged as
consumed. To work around this, we would need to store the nonce
information *after* committing the PG transaction, or delete the nonce
information from the non-transactional store if we need to rollback
the PG transaction.

Despite the high numbers, the existing nonce store does not appear to
be too much of a burden so its been left as it is. I could continue
the memcached branch to store the nonce information in the second
commit phase, essentially making it transactional (PG could still barf
on commit during that phase too since we don't use two-phase commit
despite zope.transaction supporting it, but that should only happen in
catastrophic circumstances).

We make no use of advanced memcache APIs yet, so could drop in a
replacement store such as Cassandra or other if we want to not run the
memcached servers.

>> sessions - seems a decent fit. I'm not sure if the existing setup is a
>> problem that needs solving though.
>
> We wanted to pick some good sized things to think through, I think
> sessions are working ok right now.
>
>> memcache - Using memcache is essentially free because of its
>> limitations. I don't think Cassandra is a suitable replacement for our
>> current volatile-data-only usage of memcache. There have been some
>> things we decided memcache was not suitable that Cassandra could be a
>> better fit for.
>
> Memcache isn't totally free - we spend /seconds/ chatting to it on
> some pages. I need to do some rigorous testing, but the data I have so
> far is that bugtask:+index is slowed down more than it is sped up by
> memcache.

Yes, but memcached represents the fastest possible approach to a
shared cache. For all our memcached usages, Cassandra will be slower.
Cassandra still might be fast enough, so if we are using it elsewhere
we can replace memcached with it and drop the extra moving part.

If we see requests talking to memcached for seconds, it would either be:
  - detecting a dead memcached server, which I don't think happens
unless the sysadmins are restarting servers
  - pickle time (str is just spit out as bytes, but unicode needs to
be encoded for transport)
  - big data. Need instrumentation to see if we should ignore requests
to store big data in the cache.
  - network glitches (i doubt these are happening)

I've got a bug open somewhere, describing how we can instrument things
so we can determine how much time we save using memcache (by recording
how long it took to generate the content originally, we can calculate
total time saved for all the hits. Hopefully this is non-zero ;) ).
Something for the page performance report since it involves dumping
the stats to the zserver tracelog and summarizing and crossreferencing
the results.

>> Is it suitable for replacing the bulk of the Librarian?
>
> We could, OTOH there is a SAN coming which will give our current
> implementation legs.

Replication of the data would be nice too. I'm not sure how often the
Librarian data is being synced to the backup servers, but we certainly
have a data loss window at the moment.


>> Disaster recovery will be an issue. We need things in place before we
>> put any data we care about into it.

> Secondly, backups - there is a near-instance backup facility in
> Cassandra per-node, it takes a hardlink of the current data, and you
> back that up at your leisure.

This sounds useful for building the staging environment using our
existing techniques.


> Lastly, replicating into a new cluster would be something like hadoop
> - a massive map across everything. I'm going to write to mdennis to
> explore this a little more for the staging and qastaging querstions.

Not sure how this would work.

I wonder if there is any time machine facilities around or planned? If
old records were not thrown out immediately, the Cassandra model looks
like it could cope quite well with requests for data at a particular
point in time.

>> Staging and qa systems will be interesting. I'm not sure how things
>> could be integrated. I guess we would need to build a staging
>> cassandra database from a snapshot taken after the PostgreSQL dump was
>> taken, with missing data being ok because of 'eventually consistent'.
>
> Thats my understanding.

As long as we are aware that the window might be 11 hours (at least
until we move our bulk data out of PG and the backups complete quickly
again).


-- 
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/



References