← Back to team overview

launchpad-dev team mailing list archive

Re: micro services: HTTP authentication in the datacentre and default protocol.

 

On Wed, Jun 8, 2011 at 7:30 PM, John Arbash Meinel
<john@xxxxxxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> ...
>> I would like to fix the postgresql one too; at the moment the way we
>> work with it - due to its design around clustering and schema changes
>> - is to change things once a month, which drives latency for feature
>> work and performance work - we're *just now* landing a change we could
>> have had out there for 3 weeks, if we didn't have a 4 week cycle.
>>
>> Postgresql having defects in this area isn't a reason to bring in
>> other like defects in new components :)
>>
>> -Rob
>
> Absolutely. But my point is that postgres fails on both accounts. If the
> master dies you're screwed, but you also can't stop one machine to
> upgrade while the other keeps churning.

Right. And if we were evaluating DB's today, we would have having a
discussion around precisely this point. Theres no guarantee that we
wouldn't instead go for e.g. drizzle with NDB.

> It sounds like Rabbit suffers from the same problem. Though it also
> sounds like a 3s downtime wouldn't be nearly the problem a 5-min
> downtime would be. (and much less than a 90min downtime window.)

3s is 3 times the target window for 99% of requests, its over half the
total time new pages will be alloted.

I would be less concerned with a 500ms failover (entire end->end
event), but would prefer 100ms or so. Thats still 10% of our target
request time.

> I didn't know Rabbit particularly well. And I agree you don't want to
> add more bad.
>
> However, if you have 2 Rabbits in active-passive. You stop the second
> one to upgrade it, then you do a 5s downtime to switch, and upgrade the
> first. (The old passive has become the new active). Is the issue that
> you have a complete gap? Is it possible to haproxy this (some sort of
> proxy that would queue up requests for the 5s necessary to switch over,
> without killing them).

That sounds like a great deal of complexity vs just accepting that
rabbit can fail and lose its current queue.

As for queuing requests up, yes, I think we could do that, but hell -
we have HA http services trivially, and if we backend a queue onto
e.g. cassandra we'd have (modulo split brain concerns) a truely HA
queue.

Or telehash, or even onto a DHT directly.

> Is it that you aren't able to ever create a clean break? (You'll always
> have random new requests that you can't shunt over to the new system,
> because you can't shut down the old system because it is still serving
> the last requests.)

Yes, you'll always have in-flight requests, and so you need to decide
how to handle them. Of particular concern are dispatched work items
which are not idempotent; the queue going away and coming back will
interact badly with a worker needing to report that it handled
something - particularly if the worker fails too after doing the
work...

The persistence side of the design has -long- tendrils. I'm advocating
that we do what we can without persistence - which should be a great
deal of very interesting things.

-Rob


References