← Back to team overview

nova team mailing list archive

Re: rolling upgrades


This looks like a good start, Termie.  I would also add the following:

* Handling database schema upgrades.  We need a plan for this. :)
* How do we handle partial upgrades?  What happens if a failure occurs
between steps 6 and 7 and the Compute Manager does not restart
* How do we test upgrades?  It would be great to get a test
environment and harness in place that specifically tests upgrades and
partial upgrade failures (and hopefully graceful recoveries)
* With the introduction of Microsoft Hyper-V support, we should get
some Windows folks to add a list of considerations we may have
overlooked for Windows hosts


On Mon, Oct 18, 2010 at 5:07 AM, Andy Smith <andyster@xxxxxxxxx> wrote:
> The Problem
> ===========
> Push Nova upgrades out to an existing cluster with minimal delay without
> dropping requests.
> Specific issues:
> - ComputeManager's run_instance can take quite a while to perform, we don't
>   want to pause incoming requests until they have all finished before
> upgrading
>   the Manager.
> Rollin Rollin Rollin
> ====================
> In the general case most of our infrastructure is already rather resilient
> to
> downtime due to our use of AMQP, but a few things probably need to be added.
> An ideal scenario
> -----------------
> 0. Execute an upgrade command
> 1. New code is fetched and installed (apt-get upgrade)
> 2. Send a SIGTERM to ComputeManager process
> 3. ComputeManager stops ACKing requests from the queue
> 4. ComputeManager SIGTERMs its Worker processes
> 5. Worker processes stop ACKing requests from the queue (filled only by
> Manager)
> 6. ComputeManager exits.
> 7. Supervisor process automatically restarts it || the command restarts it
> 8. When the worker has no more pending jobs it exits.
> 9. When ComputeManager restarts it fills the Worker pool with new Workers as
> the
>    old ones exit.
> 10. As soon as there is a fresh Worker, ComputeManager begins farming work
> to
>     it, starting with anything already queued.
> How to get there
> ----------------
> 0. Managers need to listen for SIGTERM and manage it.
>    This is straightforward with python's signal module.
> 1. Managers need access to their queue consumers so that they can stop them.
>    This should be a relatively minor change in service.py and manager.py
> 2. Managers need to internally keep track of outstanding async calls.
>    A DeferredQueue is probably enough, so that it can delay exiting until
> the
>    queue is exhausted.
> 3. ComputeManager, specifically, needs to have detached Worker instances.
>    Forking may have some issues with Twisted so some testing will need to be
>    done to verify.
> 4. ComputeManager, specifically, needs to communicate with Worker instances.
>    This should be fairly straightforward using AMQP routing and topics.
> 5. ComputeManager, specifically, needs to know how many old workers exist.
>    This could be as simple as writing PIDs to disk named with a UUID decided
>    upon at manager start (so all the workers started by a given manager will
>    have the same ID, which would not match the restarted manager). There is
>    probably some other clever linux hack that will do the same thing.
> 6. It seems that all non-ComputeManager services besides the public API can
>    get by with just #0 through #2, upgrading the public API is out of scope
>    for this proposal.
> Bonus: We can minimize the backlog for any given ComputeManager by being
> able
>        to drop its priority in Scheduler before initiating the upgrade.
> Thoughts?
> --andy
> _______________________________________________
> Mailing list: https://launchpad.net/~nova
> Post to     : nova@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~nova
> More help   : https://help.launchpad.net/ListHelp