← Back to team overview

launchpad-dev team mailing list archive

Re: micro services: HTTP authentication in the datacentre and default protocol.

 

On Tue, Jun 7, 2011 at 11:18 PM, John Arbash Meinel
<john@xxxxxxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> ...
>> Its doable, but AFAIK:
>>  - none of the Canonical deployments have this aspect live
>>  - its susceptible to split brain fail
>>
>> So I think we'd need to invest considerably more resources to get a
>> resilient HA rabbit. We may want to do that in the medium term, but
>> /many/ of our initial use cases for rabbit are primarily event
>> raising. So I think we can get some early benefit, and make per-case
>> risk assessments for use of its persistence features in the short
>> term.
>>
>> Anecdata: twitter, who run kestrel as their queueing system simply
>> design their code to gracefully deal with a queue server going awol
>> (be that crash, boom, whatever).
>>
>> -Rob
>
> How much of HA is because you expect Rabbit to die, and how much of HA
> is because you want a way to deploy without taking down the whole
> system? Clustering seems like it would handle the second case. One
> node's queue is temporarily offline until it is brought back up, but the
> other nodes keep serving. And if you stop accepting new entries while
> you are shutting down, then you never have any messages delayed.

Rabbit does not run active-active ever. So you can't keep serving
while one node is down: you have to fail over, which means degraded
service (at best) during the failover process (several seconds at
least from what I can tell).

> If it is that you want to plan for Rabbit (or the machine it is running
> on) to fail non-deterministically, then certainly you need different
> security guarantees.
>
> However, isn't the current Postgres master a "if it goes down we all go
> down for a while" setup? Isn't that machine pretty reliable overall? (It
> certainly also suffers from "we can't softly shut-down for upgrades",
> but it seems like the non-deterministic failures are pretty reasonable.)

I would like to fix the postgresql one too; at the moment the way we
work with it - due to its design around clustering and schema changes
- is to change things once a month, which drives latency for feature
work and performance work - we're *just now* landing a change we could
have had out there for 3 weeks, if we didn't have a 4 week cycle.

Postgresql having defects in this area isn't a reason to bring in
other like defects in new components :)

-Rob


Follow ups

References