← Back to team overview

fuel-dev team mailing list archive

Re: Stop deployment concerns

 

Christian had an issue with a customer 3.2 deployment where someone added
three extra controllers that were supposed to be computes, removing the
controllers left broken swift, haproxy, galera and rabbit configurations on
the remaining controllers. They ended up deleting the deployment and
starting over (mostly because the swift ring builder files where gone).
Some services will need special scripts to delete resources properly like
Ceph MON/OSD (although they can be yanked out except for primary node).
Some services will require all other nodes containing the service to be
re-configured, such as swift, haproxy galera and rabbit. And it could get
even worse when things like the primary-monitor are replaced. Without these
things, we should not allow the removal of the controller in an operation
like this.

Andrew
Mirantis


On Mon, Nov 25, 2013 at 2:53 AM, Bogdan Dobrelya <bdobrelia@xxxxxxxxxxxx>wrote:

>  On 11/23/2013 08:02 AM, Andrew Woodward wrote:
>
> I've thought this over some more, i think there are the following usage
> patterns around this
>
>  Given that all nodes have had OS install OK, because no nodes move to
> puppet unless all nodes finish os install, which should be another topic to
> work on
>
>  Case 1: A trivial node fails to deploy (ie, not a controller)
>
>  Current reaction, nothing the deployment should be labeled as a 'fail'
> after all nodes complete, after correcting issue you can click deploy and
> failed roles run again. [Actually thinking on this, i'm not sure how
> granular the node re-run is, if it isn't this granular, it should be]
>
>  is the desired reaction realy that we want to stop the deployment to
> correct one miniscule server when the deployment on the whole probably
> successful, on that topic, this scenario should probably considered a
> 'warning on or more services failed but the deployment was successful in
> general'
>
>  Case 2: A critical node fails to deploy (ie, primary-controller, maybe
> any controller too)
>
>  Current reaction: nothing, deployment continues until all puppet roles
> finish, cluster is failed and probably very broken. Clicking deploy again
> will restart all failed roles same as in case 1.
>
>  desired reaction: This should stop automaticly, there is no reason to
> waste any more time on a deployment that won't work and has to be
> completely re-run.
>
>  Case  3: For some reason nodes X,Y,Z have some problem (maybe
> misconfigured switch), user want's to abort deployment. (David's #1)
>
>  Current reaction, not possible
>
>  desired reaction, deployment should stop any running processes
> terminated. UI remains locked, user takes action, and then deployment
> restarts from any un-finished, or failed roles.
>
>  Case 4: For some reason user made large setting error and want's to
> reset the deployment (David's #2)
>
>  Current reaction, must destroy the whole cluster, loose config and
> manually rebuild
>
>  desired reaction, if deployment is running, errored, deployed success,
> or otherwise, cluster is reset to un-deployed state, all nodes are reset to
> discovered state. All cluster and node settings should be retained. So its
> in same state as just before user clicked deploy the first time. All pages
> are unlocked.
>
>  Case 5: Sub-function, of 4; user want's to reset one or more nodes
> trivial.
>
>  Current reaction, must remove node, click deploy, wait for node to
> rediscover, add back to cluster and re-deploy.
>
>  Desired reaction, node should go back to bootstrap state but retain all
> settings. node settings pages (nic, disk, roles) should be un-locked. For
> now, should not be allowed on controller as controllers can't be cleaned up
> out of services yet (would require cluster reset # 4)
>
> Can you elaborate please, why controller cannot be cleaned up by simple
> rejoining cluster(s)/roles from the scratch? Does non-atomic roles prevent
> this? If so, which one sexactly? Could OSD/MON roles of Ceph be
> 'redeployed' this way? AFAIK, the rejoin operation is simple scale-up
> operation and must be supported for every cluster.
>
>
>  Case 6: User wants to change a setting after deployment. (Long term goal)
>
>  Current raction, not allowed
>
>  Desired reaction, general settings should be allowed to be changed,
> impacted roles should re-run to absorb change.
>
>  Now that we can run multiple roles, this would probably require that
> each parameter and role related so that once changed we know which role to
> go re-run. Some parameters might be bad to change and maybe still wont be
> allowed to change.
>
>  Andrew
> Mirantis
>
>
> On Fri, Nov 22, 2013 at 12:15 PM, David Easter <deaster@xxxxxxxxxxxx>wrote:
>
>>  I think we have consensus.  Here's the way I'd paraphrase it, so please
>> correct me if I'm wrong:
>>
>>  Customer starts deployment, for example with 3 controllers (HA), 10
>> compute nodes and 5 cinder nodes.  During the deployment, 2 of the compute
>> nodes fail.  The customer does not want to wait for the entire deployment
>> to "finish", so he presses the Stop Deployment button.
>>
>>  At this point, the UI screens *remain locked* – I.e. configurations
>> cannot be changed.  The user can correct the issues on the nodes if they
>> are HW or OS related.  Once corrected, the user can click the Deploy
>> Changes button and Fuel will retry installing any node that did not deploy
>> correctly.  Fuel *will not* redeploy any nodes that successfully
>> installed during the first Deploy Changes effort.
>>
>>  If the user *does* want to make changes to the configuration (e.g. the
>> disk layout on one of the compute nodes), then the user will have to select
>> the *Reset Environment* button which will reset the environment to a
>> state *as if the Deploy Changes had never been clicked*.  The UI will be
>> *unlocked* and all previous choices will be retained.  The user can now
>> make any changes to the environment.  Once the changes are made, the user
>> can click Deploy Changes and Fuel will begin the deployment again from the
>> beginning.
>>
>>
>>  Does that cover two backlog stories properly?
>>
>>  Thanks,
>>
>>  -Dave Easter
>>
>>   From: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx>
>> Date: Friday, November 22, 2013 3:53 AM
>> To: Bogdan Dobrelya <bdobrelia@xxxxxxxxxxxx>
>> Cc: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>, David Easter <
>> deaster@xxxxxxxxxxxx>, Evgeniy L <eli@xxxxxxxxxxxx>, Nikolay Markov <
>> nmarkov@xxxxxxxxxxxx>, "fuel-dev@xxxxxxxxxxxxxxxxxxx" <
>> fuel-dev@xxxxxxxxxxxxxxxxxxx>
>>
>> Subject: Re: Stop deployment concerns
>>
>>  David,
>>
>>    1. Do we have a consensus here? Can you drive it with the team to
>>    completion?
>>    2. On a separate note, I think we should schedule a call to go
>>    through all the features and discuss requirements. To ensure that you and
>>    dev team are on the same page.
>>
>>  Thanks,
>> Roman
>>
>> On Friday, November 22, 2013, Bogdan Dobrelya wrote:
>>
>>>  On 11/22/2013 11:16 AM, Mike Scherbakov wrote:
>>>
>>>  + fuel-dev
>>>
>>>  We had a meeting on the topic yesterday. Research shows the following.
>>>
>>>  It would be great to be able to stop deployment at any moment, and
>>> then continue with the redeployment only failed nodes. However:
>>>
>>>    - If network configuration is changed - environment will not be
>>>    operational after deployment
>>>       - user may change net CIDRs, and without an additional
>>>       functionality in Fuel it is not currently possible to reconfigure OpenStack
>>>       (replace network information in OpenStack database)
>>>    - If some settings are changed - the same
>>>       - such as passwords, etc. - for example, controllers are already
>>>       deployed, and computes will get new information
>>>
>>> So, we have come to the decision that resetting of the whole environment
>>> is essential at the moment. We expect the following workflow:
>>>
>>>    1. If it becomes obvious that the deployment will not finish with
>>>    the success, user goes to Actions tab and clicks on "Reset Environment"
>>>    button.
>>>    2. Environment changes the status to "Resetting"
>>>    3. All settings on env become unlocked, and user is allowed to
>>>    change anything. Settings stay the same as when user clicked "Deploy"
>>>    4. Resetting of environment implies rebooting all the nodes to
>>>    boostrap state. When it is done, status of env is changed to "New", and
>>>    "Deploy" button becomes active.
>>>    5. When user is done with re-configuration, he clicks "Deploy". Fuel
>>>    should use same IP addresses / hostnames as at the time of initial
>>>    deployment, if no changes are made to networking.
>>>
>>> Thanks,
>>>
>>>
>>> On Wed, Nov 20, 2013 at 7:14 PM, Mike Scherbakov <
>>> mscherbakov@xxxxxxxxxxxx> wrote:
>>>
>>> + Evgeniy, Nick
>>>
>>>
>>> On Wed, Nov 20, 2013 at 7:01 PM, David Easter <deaster@xxxxxxxxxxxx>wrote:
>>>
>>>  I thought about this some more last night and what about this for a
>>> resolution?
>>>
>>>
>>>    1. When stop deployment is done, any successfully deployed are
>>>    flagged as successful and would not be reinstalled when Deploy Changes is
>>>    pressed again.
>>>    2. If a customer wants to reset the environment and start over, they
>>>    can use the "Reset environment" option to wipe the partially installed
>>>    environment and start over.
>>>    3. Otherwise, when Deploy Changes is clicked again, Fuel will try to
>>>    deploy only the unfinished or error-state nodes again… just as it does
>>>    today.
>>>
>>> That way, the customer has the option of starting over or just
>>> continuing from where they left off.  If controllers or network install
>>> failed, Fuel would consider that an unrecoverable error condition and just
>>> reinstall those nodes
>>>
>>>      1) I believe, we should reflect related Environment Operationschanges in Nailgun API as well
>>> https://docs.google.com/a/mirantis.com/document/d/1KQPEG62wBF-U-s8mUzAcP3_rLKOBgyEyUY9e9yKE49U/edit#heading=h.qcspsp3wasyy
>>> 2) Having an ability to reset the given node as well as the deployment,
>>> is vital for cluster self-healing. F.e., if we have STONITH'ed the failed
>>> controller node and want just redeploy it from the scratch, we might use
>>> nailgun API to reset the node to ensure it would be re-provisioned and
>>> re-deployed at the next boot...
>>>
>>>
>>> 1. this feature required only for developers (or maybe services),
>>> because in this case user will not be able to reconfigure cluster via
>>> rest-api (i.e. UI, CLI) after deployment was Stopped. If we allow
>>> configuration, then deployment in 90% cases likely to fail.
>>> 2. we cannot interrupt network configuration being in progress, to
>>> resolve this issue we need some kind of recovery mechanism for networks
>>> 3. also we cannot interrupt apt-get (and maybe yum) because it creates a
>>> lock file and puppet will fail when we will try to run it for a
>>>
>>>             This body part will be downloaded on demand.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Researcher TechLead, Mirantis, Inc.+38 (066) 051 07 53
>>> Skype bogdando_at_yahoo.com
>>> 38, Lenina ave.
>>> Kharkov, Ukrainewww.mirantis.comwww.mirantis.rubdobrelia@xxxxxxxxxxxx
>>>
>>>
>> --
>> Mailing list: https://launchpad.net/~fuel-dev
>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~fuel-dev
>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>
>
>  --
> If google has done it, Google did it right!
>
>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Researcher TechLead, Mirantis, Inc.+38 (066) 051 07 53
> Skype bogdando_at_yahoo.com
> 38, Lenina ave.
> Kharkov, Ukrainewww.mirantis.comwww.mirantis.rubdobrelia@xxxxxxxxxxxx
>
>


-- 
If google has done it, Google did it right!

References