fuel-dev team mailing list archive

Thread
Date

Re: Stop openstack patching feature

To: Evgeniy L <eli@xxxxxxxxxxxx>, Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
From: <bdobrelia@xxxxxxxxxxxx>
Date: Fri, 12 Sep 2014 14:31:30 +0000
Cc: Igor Kalnitsky <ikalnitsky@xxxxxxxxxxxx>, fuel-dev <fuel-dev@xxxxxxxxxxxxxxxxxxx>
Importance: Normal
In-reply-to: <CABfuu9qh_E1MWKgFymHDxDb7ADG-0OKgiX=R_gGCiqoZ_yHs8g@mail.gmail.com>

Hi.

We have to use additional states for clouds under patching/rollback, mostly because of:

If something goes wrong (patching interrupted or failed, orchestrator ‘killed in action’, some nodes went offline), orchestrator have to be told, eventually, the cloud is in *dirty* state - i,e. half patched and barely operational. Additional business logic, if any, could as well track this *dirty* state in order to help cloud operators to perform some recovery or rollback actions.

If something goes wrong completely (recovery or rollback failed), orchestrator have at least to be told it should *not ever try * to repeat patching / rollback actions for affected cloud due to its‘broken state` - just in order to prevent the things going even worse. Otherwise, results would be unpredictable. That is that the *fatal* state could be useful for. That should be done later - that is another question and it might be up to operator only to decide. Perhaps, only manual fallback to backups with partial data loss could help here.

Regards,

Bogdan Dobrelya.

From: Evgeniy L
Sent: ‎Thursday‎, ‎September‎ ‎11‎, ‎2014 ‎2‎:‎02‎ ‎PM
To: Mike Scherbakov
Cc: Igor Kalnitsky, fuel-dev

Hi,

>> Also, let's think and work on possible failures. What if Fuel Master node goes off during patching? What is going to be affected? How we can complete patching when Fuel Master comes back online?

The question can be summarised as "What if you kill orchestrator during the deployment?"

In this case user will get hung progress bar on UI until he removes task from nailgun.

And I'm not sure if after that he will be able to continue deployment without additional changes in db.

Actually the same questions related not only to patching, but to every task which we run under orchestrator.

The reason for this is our architecture, orchestrator was designed as a worker without persistent state.

But you need to keep somewhere the state in order to complete task after failure.

As far as I understand Mistral can help as with this issue.

>> Or compute node under patching breaks for some reason (e.g. disk issues or memory), how would it affect the patching process? How we can safely continue patching of other nodes?

How it works now, Vladimir Sharshov, correct me if I'm wrong.

We use the same strategy as for deployment.

Error during primary-controller patching - fail whole patching process

Error during patching of other roles -  continue patching process

And I'm not sure if current strategy is wrong or right.

On the one hand we shouldn't leave user's env in half patched state.

On the other hand we can break whole user's cluster because we ignore the

fact that several computes died during the patching procedure.

Thanks,

On Tue, Sep 9, 2014 at 12:15 PM, Mike Scherbakov <mscherbakov@xxxxxxxxxxxx> wrote:

Folks,
I was the one who initially requested this. I thought it's going to be pretty similar to Stop Deployment. I becomes obvious, that it is not.

I'm fine if we have it in API. Though I think what is much more important here is an ability for the user to choose a few hosts for patching first, in order to check how patching would work on a very small part of the cluster. Ideally we would even move workloads to other nodes before doing patching. We should disable scheduling of workloads for sure for these experimental hosts.

Then user can run patching against these nodes, and see how it goes. If all goes fine, patching can be applied to the rest of the environment. I do not think though that we should do all, let's say 100 nodes, at once. This sounds dangerous to me. I think we would need to come up with some less dangerous scenario.

Also, let's think and work on possible failures. What if Fuel Master node goes off during patching? What is going to be affected? How we can complete patching when Fuel Master comes back online?

Or compute node under patching breaks for some reason (e.g. disk issues or memory), how would it affect the patching process? How we can safely continue patching of other nodes?

Thanks,

On Tue, Sep 9, 2014 at 12:08 PM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx> wrote:

Sorry again. Look 2 messages below, please.

09 сент. 2014 г. 12:06 пользователь "Vladimir Kuklin" <vkuklin@xxxxxxxxxxxx> написал:

Sorry, hit reply instead of replyall.

09 сент. 2014 г. 12:05 пользователь "Vladimir Kuklin" <vkuklin@xxxxxxxxxxxx> написал:

+1

Also, I think, we should add stop patching at least to api in order to allow advanced users and service team to do what they want.

09 сент. 2014 г. 12:02 пользователь "Igor Kalnitsky" <ikalnitsky@xxxxxxxxxxxx> написал:

What we should to do with nodes in case of interrupt patching? I think
we need to mark them for re-deployment, since nodes' state may be
broken.

Any opinion?

- Igor

On Mon, Sep 8, 2014 at 3:28 PM, Evgeniy L <eli@xxxxxxxxxxxx> wrote:
> Hi,
>
> We were working on implementation of experimental feature
> where user could interrupt openstack patching procedure [1].
>
> It's not as easy to implement as we thought it would be.
> Current stop deployment mechanism [2] stops puppet, erases
> nodes and reboots them into bootstrap. It's ok for stop
> deployment, but it's not ok for patching, because user
> can loose his data. We can rewrite some logic in nailgun
> and in orchestrator to stop puppet and not to erase nodes.
> But I'm not sure if it works correctly because such use
> case wasn't tested. And I can see the problems like
> yum/apt-get locks cleaning after puppet interruption.
>
> As result I have several questions:
> 1. should we try to make it work for the current release?
> 2. if we shouldn't, will we need this feature for the future
>     releases? Definitely additional design and research is
>     required.
>
> [1] https://bugs.launchpad.net/fuel/+bug/1364907
> [2]
> https://github.com/stackforge/fuel-astute/blob/b622d9b36dbdd1e03b282b9ee5b7435ba649e711/lib/astute/server/dispatcher.rb#L163-L164
>
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp
>

--
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp

--
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp

-- 

Mike Scherbakov
#mihgen

--
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp

References

Stop openstack patching feature
From: Evgeniy L, 2014-09-08
Re: Stop openstack patching feature
From: Igor Kalnitsky, 2014-09-09
Re: Stop openstack patching feature
From: Vladimir Kuklin, 2014-09-09
Re: Stop openstack patching feature
From: Mike Scherbakov, 2014-09-09
Re: Stop openstack patching feature
From: Evgeniy L, 2014-09-11