← Back to team overview

launchpad-dev team mailing list archive

modelling our queuing needs - HA analysis

 

So we kindof got into an analysis paralysis situation with Rabbit and
persistence. Specifically we don't know how much HA we need, or when
we need it. This makes it hard to say that rabbit is/isn't HA enough
for us.

http://zguide.zeromq.org/page:all is well worth reading - irrespective
of our choices, its got a very well reasoned layout of many issues and
antipatterns that can happen.

I'm borrowing from it quite liberally in putting this model together.

Firstly, lets define some forms of queue use we'd like to have - I'm
going to define three:
pubsub: publisher knows nothing about the subscribers, just fires off
events. Perhaps we might want durable subscriptions (where a
disconnected subscriber can catch up later). All subscribers get all
messages. Subscribers know what they are subscribed to (but there
could be many publishers connected to the same subscribers).

rpc: send a message, get a reply back. We want N possible senders and
M possible replying workers, with the reply from a worker going back
to the sender automatically.

dispatch: We have work we want to hand off, which we care that the
work has been started / handed off.

Now failure modes:
sender failure: the sender fails before the message is sent
transient worker failure: a worker fails after receiving a message,
but the failure isn't permanent. e.g. the server had a power failure,
or a process crashed.
permanent worker failure: a worker fails after receiving a message,
and cannot be brought back.
connectivity failure: the path between the sender and worker is down
(e.g. network failure between our two primary datacentres, or a
message broker is down).


rpc is the easiest use case to analyze.
In an rpc model what we want to do is making asynchronous function
calls to other systems for parallelism / consolidation. In this
scenario we're going to typically be within one web request: we need
extremely low latency (ms scale) responses.
For a sender failure, the worker should just discard its reply
for a worker failure we can fail the web request
ditto connectivity failures
-> we don't need HA at the scale of single messages
However, to keep the web app working and available, we do need HA at
the scale of workers, senders and connectivity.
For workers: If a worker dies, work needs to be dispatched to other
workers immediately.
For senders: Workers could block for a few seconds while detecting
that a sender they were replying to has gone awol / had connectivity
issues.
For connectivity: We need to deal with a connectivity issue
immediately (as long as there is *a* working path to a worker, work
should be dispatched, and the delay in detecting a problem needs to be
low (e.g. < 100ms) - we have a goal of 1s for 99% of requests, 100ms
is 10% of our budgeted time.

nondurable pubsub is also fairly easy to analyze.
For pubsub we just announce things happening. Some typical uses:
 - an object cache that listens for all modified objects - its ok to
miss all messages from before the cache starts up, but once running it
wants to get everything.
 - oops reporting gathering
 - other cache regeneration (such as bugsummary, teamparticipation) -
these are things where a new cache could start empty and rescan to
build up state, and use the notification feed to catch up.
Now, if we use something like this for keeping caches up to date, we
either need time based heuristics, or complete confidence that all
messages will get through [eventually].
Sender failures would be ok if we use 2pc with the database, or if we
don't include content (e.g. if we only say  'obj foo has changed,
reread it') in the messages.
Worker failures would be ok - the workers are designed to start over each time.
Connectivity failures - A connectivity failure would be
indistinguishable from a sender failure if and only if it causes the
send to error. But senders don't know if there are workers; if there
is a broker the sender could detect the brokers presence; if there
isn't a broker, then spooling locally might be ok if we are confident
that the spool can be recovered.
I suspect we can tolerate a delay on the order of hours to recover
such a stored spool.

For durable pubsub we'd be adding a significant reliabilty increase:
we'd be requiring that every message get delivered. Uses for this are
more tightly cohesive caches that never regenerate.
It is I think equivalent in needs to the dispatch case.

For the work dispatching model:
 - we could model the work queue in postgresql and just dispatch
triggers. If we do this then the trigger can be hugely lossy and we
won't care.
 - -or- we depend on the messages being received and processed - our
data model correctness depends on the dispatched work happening. lets
assume this approach.
For a transient worker failure, we'd be ok - it will happen when the
worker comes back.
For a permanent worker failure, we'd need to redispatch to a different worker.
For a sender failure, we'd need the worker to not get the message -
2pc essentially
For a connectivity failure, the 2pc would protect us against work
happening wrongly. However we would have to be able to dispatch the
work or else we'd block the related DB transaction (due to 2pc).
-> We'd need similar (100ms) detection of the right route to us when
dealing with a connectivity failure.

Summary:
 - depending on what we're doing our needs vary
 - we'll want (if not immediately, then in the medium term) fast (<
100ms) detection of routes to get messages to the right place
 - for some uses we want very reliable message store-and-forward with
two phase commit (or idempotent operations + cross-checks with the DB
to detect failed transactions)
 - many uses will not need persistence, but will need precision : work
reliably or fail clearly and quickly.

-Rob


Follow ups