← Back to team overview

fuel-dev team mailing list archive

Re: Stop openstack patching feature

 

Hi,

>> Also, let's think and work on possible failures. What if Fuel Master
node goes off during patching? What is going to be affected? How we can
complete patching when Fuel Master comes back online?

The question can be summarised as "What if you kill orchestrator during the
deployment?"
In this case user will get hung progress bar on UI until he removes task
from nailgun.
And I'm not sure if after that he will be able to continue deployment
without additional changes in db.
Actually the same questions related not only to patching, but to every task
which we run under orchestrator.
The reason for this is our architecture, orchestrator was designed as a
worker without persistent state.
But you need to keep somewhere the state in order to complete task after
failure.
As far as I understand Mistral can help as with this issue.

>> Or compute node under patching breaks for some reason (e.g. disk issues
or memory), how would it affect the patching process? How we can safely
continue patching of other nodes?

How it works now, Vladimir Sharshov, correct me if I'm wrong.
We use the same strategy as for deployment.

Error during primary-controller patching - fail whole patching process
Error during patching of other roles -  continue patching process

And I'm not sure if current strategy is wrong or right.
On the one hand we shouldn't leave user's env in half patched state.
On the other hand we can break whole user's cluster because we ignore the
fact that several computes died during the patching procedure.

Thanks,


On Tue, Sep 9, 2014 at 12:15 PM, Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
wrote:

> Folks,
> I was the one who initially requested this. I thought it's going to be
> pretty similar to Stop Deployment. I becomes obvious, that it is not.
>
> I'm fine if we have it in API. Though I think what is much more important
> here is an ability for the user to choose a few hosts for patching first,
> in order to check how patching would work on a very small part of the
> cluster. Ideally we would even move workloads to other nodes before doing
> patching. We should disable scheduling of workloads for sure for these
> experimental hosts.
> Then user can run patching against these nodes, and see how it goes. If
> all goes fine, patching can be applied to the rest of the environment. I do
> not think though that we should do all, let's say 100 nodes, at once. This
> sounds dangerous to me. I think we would need to come up with some less
> dangerous scenario.
>
> Also, let's think and work on possible failures. What if Fuel Master node
> goes off during patching? What is going to be affected? How we can complete
> patching when Fuel Master comes back online?
>
> Or compute node under patching breaks for some reason (e.g. disk issues or
> memory), how would it affect the patching process? How we can safely
> continue patching of other nodes?
>
> Thanks,
>
> On Tue, Sep 9, 2014 at 12:08 PM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx>
> wrote:
>
>> Sorry again. Look 2 messages below, please.
>>  09 сент. 2014 г. 12:06 пользователь "Vladimir Kuklin" <
>> vkuklin@xxxxxxxxxxxx> написал:
>>
>>> Sorry, hit reply instead of replyall.
>>> 09 сент. 2014 г. 12:05 пользователь "Vladimir Kuklin" <
>>> vkuklin@xxxxxxxxxxxx> написал:
>>>
>>>> +1
>>>>
>>>> Also, I think, we should add stop patching at least to api in order to
>>>> allow advanced users and service team to do what they want.
>>>> 09 сент. 2014 г. 12:02 пользователь "Igor Kalnitsky" <
>>>> ikalnitsky@xxxxxxxxxxxx> написал:
>>>>
>>>> What we should to do with nodes in case of interrupt patching? I think
>>>>> we need to mark them for re-deployment, since nodes' state may be
>>>>> broken.
>>>>>
>>>>> Any opinion?
>>>>>
>>>>> - Igor
>>>>>
>>>>> On Mon, Sep 8, 2014 at 3:28 PM, Evgeniy L <eli@xxxxxxxxxxxx> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > We were working on implementation of experimental feature
>>>>> > where user could interrupt openstack patching procedure [1].
>>>>> >
>>>>> > It's not as easy to implement as we thought it would be.
>>>>> > Current stop deployment mechanism [2] stops puppet, erases
>>>>> > nodes and reboots them into bootstrap. It's ok for stop
>>>>> > deployment, but it's not ok for patching, because user
>>>>> > can loose his data. We can rewrite some logic in nailgun
>>>>> > and in orchestrator to stop puppet and not to erase nodes.
>>>>> > But I'm not sure if it works correctly because such use
>>>>> > case wasn't tested. And I can see the problems like
>>>>> > yum/apt-get locks cleaning after puppet interruption.
>>>>> >
>>>>> > As result I have several questions:
>>>>> > 1. should we try to make it work for the current release?
>>>>> > 2. if we shouldn't, will we need this feature for the future
>>>>> >     releases? Definitely additional design and research is
>>>>> >     required.
>>>>> >
>>>>> > [1] https://bugs.launchpad.net/fuel/+bug/1364907
>>>>> > [2]
>>>>> >
>>>>> https://github.com/stackforge/fuel-astute/blob/b622d9b36dbdd1e03b282b9ee5b7435ba649e711/lib/astute/server/dispatcher.rb#L163-L164
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Mailing list: https://launchpad.net/~fuel-dev
>>>>> > Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>>> > Unsubscribe : https://launchpad.net/~fuel-dev
>>>>> > More help   : https://help.launchpad.net/ListHelp
>>>>> >
>>>>>
>>>>> --
>>>>> Mailing list: https://launchpad.net/~fuel-dev
>>>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>
>>>>
>> --
>> Mailing list: https://launchpad.net/~fuel-dev
>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~fuel-dev
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>
>
> --
> Mike Scherbakov
> #mihgen
>
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp
>
>

Follow ups

References