nova team mailing list archive
Mailing list archive
Re: rolling upgrades
This looks like a good start, Termie. I would also add the following:
* Handling database schema upgrades. We need a plan for this. :)
* How do we handle partial upgrades? What happens if a failure occurs
between steps 6 and 7 and the Compute Manager does not restart
* How do we test upgrades? It would be great to get a test
environment and harness in place that specifically tests upgrades and
partial upgrade failures (and hopefully graceful recoveries)
* With the introduction of Microsoft Hyper-V support, we should get
some Windows folks to add a list of considerations we may have
overlooked for Windows hosts
On Mon, Oct 18, 2010 at 5:07 AM, Andy Smith <andyster@xxxxxxxxx> wrote:
> The Problem
> Push Nova upgrades out to an existing cluster with minimal delay without
> dropping requests.
> Specific issues:
> - ComputeManager's run_instance can take quite a while to perform, we don't
> want to pause incoming requests until they have all finished before
> the Manager.
> Rollin Rollin Rollin
> In the general case most of our infrastructure is already rather resilient
> downtime due to our use of AMQP, but a few things probably need to be added.
> An ideal scenario
> 0. Execute an upgrade command
> 1. New code is fetched and installed (apt-get upgrade)
> 2. Send a SIGTERM to ComputeManager process
> 3. ComputeManager stops ACKing requests from the queue
> 4. ComputeManager SIGTERMs its Worker processes
> 5. Worker processes stop ACKing requests from the queue (filled only by
> 6. ComputeManager exits.
> 7. Supervisor process automatically restarts it || the command restarts it
> 8. When the worker has no more pending jobs it exits.
> 9. When ComputeManager restarts it fills the Worker pool with new Workers as
> old ones exit.
> 10. As soon as there is a fresh Worker, ComputeManager begins farming work
> it, starting with anything already queued.
> How to get there
> 0. Managers need to listen for SIGTERM and manage it.
> This is straightforward with python's signal module.
> 1. Managers need access to their queue consumers so that they can stop them.
> This should be a relatively minor change in service.py and manager.py
> 2. Managers need to internally keep track of outstanding async calls.
> A DeferredQueue is probably enough, so that it can delay exiting until
> queue is exhausted.
> 3. ComputeManager, specifically, needs to have detached Worker instances.
> Forking may have some issues with Twisted so some testing will need to be
> done to verify.
> 4. ComputeManager, specifically, needs to communicate with Worker instances.
> This should be fairly straightforward using AMQP routing and topics.
> 5. ComputeManager, specifically, needs to know how many old workers exist.
> This could be as simple as writing PIDs to disk named with a UUID decided
> upon at manager start (so all the workers started by a given manager will
> have the same ID, which would not match the restarted manager). There is
> probably some other clever linux hack that will do the same thing.
> 6. It seems that all non-ComputeManager services besides the public API can
> get by with just #0 through #2, upgrading the public API is out of scope
> for this proposal.
> Bonus: We can minimize the backlog for any given ComputeManager by being
> to drop its priority in Scheduler before initiating the upgrade.
> Mailing list: https://launchpad.net/~nova
> Post to : nova@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~nova
> More help : https://help.launchpad.net/ListHelp