fuel-dev team mailing list archive

Thread
Date

Re: Stop deployment concerns

To: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>, David Easter <deaster@xxxxxxxxxxxx>, Evgeniy L <eli@xxxxxxxxxxxx>, Nikolay Markov <nmarkov@xxxxxxxxxxxx>, fuel-dev@xxxxxxxxxxxxxxxxxxx
From: Bogdan Dobrelya <bdobrelia@xxxxxxxxxxxx>
Date: Fri, 22 Nov 2013 11:33:30 +0200
In-reply-to: <CAKYN3rNoehtXAtv4dsFi5_hpcm3pqyQzwsbfFbut8-e5weeocg@mail.gmail.com>
Organization: Mirantis
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0

On 11/22/2013 11:16 AM, Mike Scherbakov wrote:

+ fuel-dev

We had a meeting on the topic yesterday. Research shows the following.

It would be great to be able to stop deployment at any moment, andthen continue with the redeployment only failed nodes. However:


  * If network configuration is changed - environment will not be
    operational after deployment
      o user may change net CIDRs, and without an additional
        functionality in Fuel it is not currently possible to
        reconfigure OpenStack (replace network information in
        OpenStack database)
  * If some settings are changed - the same
      o such as passwords, etc. - for example, controllers are already
        deployed, and computes will get new information

So, we have come to the decision that resetting of the wholeenvironment is essential at the moment. We expect the following workflow:


 1. If it becomes obvious that the deployment will not finish with the
    success, user goes to Actions tab and clicks on "Reset
    Environment" button.
 2. Environment changes the status to "Resetting"
 3. All settings on env become unlocked, and user is allowed to change
    anything. Settings stay the same as when user clicked "Deploy"
 4. Resetting of environment implies rebooting all the nodes to
    boostrap state. When it is done, status of env is changed to
    "New", and "Deploy" button becomes active.
 5. When user is done with re-configuration, he clicks "Deploy". Fuel
    should use same IP addresses / hostnames as at the time of initial
    deployment, if no changes are made to networking.

Thanks,

On Wed, Nov 20, 2013 at 7:14 PM, Mike Scherbakov<mscherbakov@xxxxxxxxxxxx <mailto:mscherbakov@xxxxxxxxxxxx>> wrote:


    + Evgeniy, Nick


    On Wed, Nov 20, 2013 at 7:01 PM, David Easter
    <deaster@xxxxxxxxxxxx <mailto:deaster@xxxxxxxxxxxx>> wrote:

        I thought about this some more last night and what about this
        for a resolution?

         1. When stop deployment is done, any successfully deployed
            are flagged as successful and would not be reinstalled
            when Deploy Changes is pressed again.
         2. If a customer wants to reset the environment and start
            over, they can use the "Reset environment" option to wipe
            the partially installed environment and start over.
         3. Otherwise, when Deploy Changes is clicked again, Fuel will
            try to deploy only the unfinished or error-state nodes
            again... just as it does today.

        That way, the customer has the option of starting over or just
        continuing from where they left off.  If controllers or
        network install failed, Fuel would consider that an
        unrecoverable error condition and just reinstall those
        nodes/components again -- just like today.

        This would also help to eliminate the confusion about when to
        use reset environment.  You'd always use them in tandem if you
        wanted to start over -- I.e. reset environment is used either
        during an aborted deployment or after a cloud has been up for
        awhile; either case.

        Thanks,

        -Dave Easter

        From: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx
        <mailto:ralekseenkov@xxxxxxxxxxxx>>
        Date: Tuesday, November 19, 2013 8:28 AM
        To: David Easter <deaster@xxxxxxxxxxxx
        <mailto:deaster@xxxxxxxxxxxx>>
        Cc: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx
        <mailto:mscherbakov@xxxxxxxxxxxx>>
        Subject: Re: Stop deployment concerns

        David,

        I'll let Mike comment regarding how feasible it is to recover
        from failures of individual nodes. AFAIK, there are quite a
        few different scenarios, e.g. restarting a failed controller
        node may be harder that restarting a compute node.

        Can you rewrite the wiki page with requirements, so Mike and I
        can review it tomorrow? In particular:

          * remove the notion of pause/unpause
              o user should either wait until completion of deployment
                process
              o or they can abort in the middle, if they choose so
              o if some nodes fail, user should still wait while the
                whole deployment finishes before doing any more
                actions, such as restarting the nodes (that's the way
                how it is designed --- you accumulate changes, then
                "push" them at once using a deploy button)
          * explain the use cases & desired paths to recover from
            them. e.g.
              o you are deploying from scratch and the whole
                deployment fails
              o you are deploying from scratch and some nodes fail
              o you are doing an incremental deployment and adding a
                new compute (ceph, etc) node to the cluster, and it fails
              o what happens if you add and remove nodes in a single
                "push"? what happens if you attempt to terminate an
                action like this in the middle?
              o etc

        We need to carefully think through all scenarios.

        Thanks,
        Roman

        On Tuesday, November 19, 2013, David Easter wrote:

            Mike / Roman,

              My understanding in conversations with Services is that
            there is a scenario where much of the deployment runs
            properly -- e.g. controllers have successfully been
            deployed  and several compute nodes successfully deploy.
             However, a particular node may encounter an error
            condition that requires the admin to perform an action --
            sometimes making an OS level or HW change but sometimes
            just having to redo the install against that node.  In
            fact, it happens to me all the time on my laptop when
            installing on VirtualBox where one of the nodes fails to
            install due to resource contention.

              I feel it is a completely valid scenario where the
            deployment of tends of nodes may be occurring and one node
            in the middle of the deployment fails.   Rather than
            wiping out all of the nodes and starting over, one would
            want to just retry the nodes that failed.  My
            understanding is that we can do that today, in fact, after
            the deployment finishes with errors so why would it not
            also be a scenario where the deployment is stopped.  All
            I'm saying is that there should be an ability to retry the
            nodes that failed or were aborted vs. wiping all the nodes
            and starting over.

              So for a particular node, I'm not suggesting that the
            node itself not be reset to bootstrap.  However, I'm
            saying for the environment, everything should not be reset
            to bootstrap -- only the nodes that failed.

            Thanks,

            -Dave Easter

            From: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
            Date: Tuesday, November 19, 2013 5:11 AM
            To: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx>
            Cc: David Easter <deaster@xxxxxxxxxxxx>
            Subject: Re: Stop deployment concerns

            Roman,
            see inline.


            On Tue, Nov 19, 2013 at 5:04 PM, Roman Alekseenkov
            <ralekseenkov@xxxxxxxxxxxx> wrote:

                I strongly disagree with some of the statements. We
                need to define first what problem we are trying to solve.

                For me, the key value of "stop deployment" feature is
                to abort deployment in case something went wrong (e.g.
                anaconda failed, and UI is stuck for 90 minutes
                waiting for its completion). Forcing user to wait
                without the ability to "stop deployment", knowing that
                something is already wrong, is not smart.

            The same I speak about in "resetting" - we abort
            deployment, and reset the env to the initial stage.
            Customer does not wait.


                I don't care about "resuming" deployment after it has
                been stopped. All I care about is a quick way to
                "stop" -> "reset" -> "install from scratch". Thus, I
                don't care much about terminating apt-get in the
                middle, etc. as the nodes will be wiped out anyway.

            Neither we do in "resetting", but we do in "stop
            deployment", which implies "resuming" after some configs
            manipulation.


                Moreover, I think "abort deployment" is a more
                appropriate term here than "stop deployment".

            I agree, but it still means that we abort puppet run /
            whatever other phase and hope that we can resume from the
            point where we aborted.


                David - let me know if you have a different use case
                in mind. I just read the requirements and saw words
                about pausing the deployment, restarting the
                installation process from where it was paused, and so
                on. Why did we add these? There doesn't seem to be a
                reliable way of pausing and resuming deployment.

            Agree.

            So with all comments, I tried to say the same in my
            initial email. And that's why we suggest to concentrate
            the work on aborting deployment with further reset to
            bootstrap state, which falls into "Resetting environment"
            feature - and this will be reliable way for:

              * aborting deployment
              * modification of configuration (not creating new env,
                just modifying the existing one)
              * starting deployment from the beginning with new
                configuration


                Thanks,
                Roman


                On Tuesday, November 19, 2013, Mike Scherbakov wrote:

                    David, Roman,
                    research behind Stop Deployment
                    <https://mirantis.jira.com/wiki/display/PRD/Stop+Deployment> feature
                    which is #4 must have in 4.0 shows a number of
                    concerns:

1) I believe, we should reflect related Environment Operations changesin Nailgun API as wellhttps://docs.google.com/a/mirantis.com/document/d/1KQPEG62wBF-U-s8mUzAcP3_rLKOBgyEyUY9e9yKE49U/edit#heading=h.qcspsp3wasyy2) Having an ability to reset the given node as well as the deployment,is vital for cluster self-healing. F.e., if we have STONITH'ed thefailed controller node and want just redeploy it from the scratch, wemight use nailgun API to reset the node to ensure it would bere-provisioned and re-deployed at the next boot...



                    1. this feature required only for developers (or
                    maybe services), because in this case user will
                    not be able to reconfigure cluster via rest-api
                    (i.e. UI, CLI) after deployment was Stopped. If we
                    allow configuration, then deployment in 90% cases
                    likely to fail.
                    2. we cannot interrupt network configuration being
                    in progress, to resolve this issue we need some
                    kind of recovery mechanism for networks
                    3. also we cannot interrupt apt-get (and maybe
                    yum) because it creates a lock file and puppet
                    will fail when we will try to run it for a second
                    time

                    It is also considered risky by QA team - they
                    expect a lot of issues here, and dev team also
                    sees possible delays in delivery and a lot of risk
                    in it.

                    *Our suggestion is the following:*

                     1. We track Stop D

--Mike Scherbakov

    #mihgen




--
Mike Scherbakov
#mihgen


This body part will be downloaded on demand.



--
Best regards,
Bogdan Dobrelya,
Researcher TechLead, Mirantis, Inc.
+38 (066) 051 07 53
Skype bogdando_at_yahoo.com
38, Lenina ave.
Kharkov, Ukraine
www.mirantis.com
www.mirantis.ru
bdobrelia@xxxxxxxxxxxx

Follow ups

Re: Stop deployment concerns
From: Roman Alekseenkov, 2013-11-22

References

Re: Stop deployment concerns
From: Mike Scherbakov, 2013-11-22