← Back to team overview

fuel-dev team mailing list archive

Re: Stop deployment concerns

 

+ fuel-dev

We had a meeting on the topic yesterday. Research shows the following.

It would be great to be able to stop deployment at any moment, and then
continue with the redeployment only failed nodes. However:

   - If network configuration is changed - environment will not be
   operational after deployment
      - user may change net CIDRs, and without an additional functionality
      in Fuel it is not currently possible to reconfigure OpenStack (replace
      network information in OpenStack database)
   - If some settings are changed - the same
      - such as passwords, etc. - for example, controllers are already
      deployed, and computes will get new information

So, we have come to the decision that resetting of the whole environment is
essential at the moment. We expect the following workflow:

   1. If it becomes obvious that the deployment will not finish with the
   success, user goes to Actions tab and clicks on "Reset Environment" button.
   2. Environment changes the status to "Resetting"
   3. All settings on env become unlocked, and user is allowed to change
   anything. Settings stay the same as when user clicked "Deploy"
   4. Resetting of environment implies rebooting all the nodes to boostrap
   state. When it is done, status of env is changed to "New", and "Deploy"
   button becomes active.
   5. When user is done with re-configuration, he clicks "Deploy". Fuel
   should use same IP addresses / hostnames as at the time of initial
   deployment, if no changes are made to networking.

Thanks,


On Wed, Nov 20, 2013 at 7:14 PM, Mike Scherbakov
<mscherbakov@xxxxxxxxxxxx>wrote:

> + Evgeniy, Nick
>
>
> On Wed, Nov 20, 2013 at 7:01 PM, David Easter <deaster@xxxxxxxxxxxx>wrote:
>
>> I thought about this some more last night and what about this for a
>> resolution?
>>
>>
>>    1. When stop deployment is done, any successfully deployed are
>>    flagged as successful and would not be reinstalled when Deploy Changes is
>>    pressed again.
>>    2. If a customer wants to reset the environment and start over, they
>>    can use the "Reset environment" option to wipe the partially installed
>>    environment and start over.
>>    3. Otherwise, when Deploy Changes is clicked again, Fuel will try to
>>    deploy only the unfinished or error-state nodes again… just as it does
>>    today.
>>
>> That way, the customer has the option of starting over or just continuing
>> from where they left off.  If controllers or network install failed, Fuel
>> would consider that an unrecoverable error condition and just reinstall
>> those nodes/components again – just like today.
>>
>> This would also help to eliminate the confusion about when to use reset
>> environment.  You'd always use them in tandem if you wanted to start over –
>> I.e. reset environment is used either during an aborted deployment or after
>> a cloud has been up for awhile; either case.
>>
>> Thanks,
>>
>> -Dave Easter
>>
>> From: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx>
>> Date: Tuesday, November 19, 2013 8:28 AM
>> To: David Easter <deaster@xxxxxxxxxxxx>
>> Cc: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
>> Subject: Re: Stop deployment concerns
>>
>> David,
>>
>> I'll let Mike comment regarding how feasible it is to recover from
>> failures of individual nodes. AFAIK, there are quite a few different
>> scenarios, e.g. restarting a failed controller node may be harder that
>> restarting a compute node.
>>
>> Can you rewrite the wiki page with requirements, so Mike and I can review
>> it tomorrow? In particular:
>>
>>    - remove the notion of pause/unpause
>>       - user should either wait until completion of deployment process
>>       - or they can abort in the middle, if they choose so
>>       - if some nodes fail, user should still wait while the whole
>>       deployment finishes before doing any more actions, such as restarting the
>>       nodes (that's the way how it is designed --- you accumulate changes, then
>>       "push" them at once using a deploy button)
>>    - explain the use cases & desired paths to recover from them. e.g.
>>       - you are deploying from scratch and the whole deployment fails
>>       - you are deploying from scratch and some nodes fail
>>       - you are doing an incremental deployment and adding a new compute
>>       (ceph, etc) node to the cluster, and it fails
>>       - what happens if you add and remove nodes in a single "push"?
>>       what happens if you attempt to terminate an action like this in the middle?
>>       - etc
>>
>> We need to carefully think through all scenarios.
>>
>> Thanks,
>> Roman
>>
>> On Tuesday, November 19, 2013, David Easter wrote:
>>
>>> Mike / Roman,
>>>
>>>   My understanding in conversations with Services is that there is a
>>> scenario where much of the deployment runs properly – e.g. controllers have
>>> successfully been deployed  and several compute nodes successfully deploy.
>>>  However, a particular node may encounter an error condition that requires
>>> the admin to perform an action – sometimes making an OS level or HW change
>>> but sometimes just having to redo the install against that node.  In fact,
>>> it happens to me all the time on my laptop when installing on VirtualBox
>>> where one of the nodes fails to install due to resource contention.
>>>
>>>   I feel it is a completely valid scenario where the deployment of tends
>>> of nodes may be occurring and one node in the middle of the deployment
>>> fails.   Rather than wiping out all of the nodes and starting over, one
>>> would want to just retry the nodes that failed.  My understanding is that
>>> we can do that today, in fact, after the deployment finishes with errors so
>>> why would it not also be a scenario where the deployment is stopped.  All
>>> I'm saying is that there should be an ability to retry the nodes that
>>> failed or were aborted vs. wiping all the nodes and starting over.
>>>
>>>   So for a particular node, I'm not suggesting that the node itself not
>>> be reset to bootstrap.  However, I'm saying for the environment, everything
>>> should not be reset to bootstrap – only the nodes that failed.
>>>
>>> Thanks,
>>>
>>> -Dave Easter
>>>
>>> From: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
>>> Date: Tuesday, November 19, 2013 5:11 AM
>>> To: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx>
>>> Cc: David Easter <deaster@xxxxxxxxxxxx>
>>> Subject: Re: Stop deployment concerns
>>>
>>> Roman,
>>> see inline.
>>>
>>>
>>> On Tue, Nov 19, 2013 at 5:04 PM, Roman Alekseenkov <
>>> ralekseenkov@xxxxxxxxxxxx> wrote:
>>>
>>> I strongly disagree with some of the statements. We need to define first
>>> what problem we are trying to solve.
>>>
>>> For me, the key value of "stop deployment" feature is to abort
>>> deployment in case something went wrong (e.g. anaconda failed, and UI is
>>> stuck for 90 minutes waiting for its completion). Forcing user to wait
>>> without the ability to "stop deployment", knowing that something is already
>>> wrong, is not smart.
>>>
>>> The same I speak about in "resetting" - we abort deployment, and reset
>>> the env to the initial stage. Customer does not wait.
>>>
>>>
>>> I don't care about "resuming" deployment after it has been stopped. All
>>> I care about is a quick way to "stop" -> "reset" -> "install from scratch".
>>> Thus, I don't care much about terminating apt-get in the middle, etc. as
>>> the nodes will be wiped out anyway.
>>>
>>> Neither we do in "resetting", but we do in "stop deployment", which
>>> implies "resuming" after some configs manipulation.
>>>
>>>
>>> Moreover, I think "abort deployment" is a more appropriate term here
>>> than "stop deployment".
>>>
>>> I agree, but it still means that we abort puppet run / whatever other
>>> phase and hope that we can resume from the point where we aborted.
>>>
>>>
>>> David - let me know if you have a different use case in mind. I just
>>> read the requirements and saw words about pausing the deployment,
>>> restarting the installation process from where it was paused, and so on.
>>> Why did we add these? There doesn't seem to be a reliable way of pausing
>>> and resuming deployment.
>>>
>>> Agree.
>>>
>>> So with all comments, I tried to say the same in my initial email. And
>>> that's why we suggest to concentrate the work on aborting deployment with
>>> further reset to bootstrap state, which falls into "Resetting environment"
>>> feature - and this will be reliable way for:
>>>
>>>    - aborting deployment
>>>    - modification of configuration (not creating new env, just
>>>    modifying the existing one)
>>>    - starting deployment from the beginning with new configuration
>>>
>>>
>>> Thanks,
>>> Roman
>>>
>>>
>>> On Tuesday, November 19, 2013, Mike Scherbakov wrote:
>>>
>>> David, Roman,
>>> research behind Stop Deployment<https://mirantis.jira.com/wiki/display/PRD/Stop+Deployment> feature
>>> which is #4 must have in 4.0 shows a number of concerns:
>>>
>>> 1. this feature required only for developers (or maybe services),
>>> because in this case user will not be able to reconfigure cluster via
>>> rest-api (i.e. UI, CLI) after deployment was Stopped. If we allow
>>> configuration, then deployment in 90% cases likely to fail.
>>> 2. we cannot interrupt network configuration being in progress, to
>>> resolve this issue we need some kind of recovery mechanism for networks
>>> 3. also we cannot interrupt apt-get (and maybe yum) because it creates a
>>> lock file and puppet will fail when we will try to run it for a second time
>>>
>>> It is also considered risky by QA team - they expect a lot of issues
>>> here, and dev team also sees possible delays in delivery and a lot of risk
>>> in it.
>>>
>>> *Our suggestion is the following:*
>>>
>>>    1. We track Stop D
>>>
>>>
>
>
> --
> Mike Scherbakov
> #mihgen
>



-- 
Mike Scherbakov
#mihgen

Follow ups