fuel-dev team mailing list archive

Thread
Date

Re: Stop deployment concerns

To: Andrew Woodward <xarses@xxxxxxxxx>, David Easter <deaster@xxxxxxxxxxxx>
From: Bogdan Dobrelya <bdobrelia@xxxxxxxxxxxx>
Date: Mon, 25 Nov 2013 12:53:34 +0200
Cc: "fuel-dev@xxxxxxxxxxxxxxxxxxx" <fuel-dev@xxxxxxxxxxxxxxxxxxx>, Evgeniy L <eli@xxxxxxxxxxxx>
In-reply-to: <CACEfbZhCLZoXbOaRzZdV5B=MhHqEgJa4QU7conhiKwjr7Uy85Q@mail.gmail.com>
Organization: Mirantis
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0

On 11/23/2013 08:02 AM, Andrew Woodward wrote:

I've thought this over some more, i think there are the followingusage patterns around this
Given that all nodes have had OS install OK, because no nodes move topuppet unless all nodes finish os install, which should be anothertopic to work on
Case 1: A trivial node fails to deploy (ie, not a controller)
Current reaction, nothing the deployment should be labeled as a 'fail'after all nodes complete, after correcting issue you can click deployand failed roles run again. [Actually thinking on this, i'm not surehow granular the node re-run is, if it isn't this granular, it should be]
is the desired reaction realy that we want to stop the deployment tocorrect one miniscule server when the deployment on the whole probablysuccessful, on that topic, this scenario should probably considered a'warning on or more services failed but the deployment was successfulin general'
Case 2: A critical node fails to deploy (ie, primary-controller, maybeany controller too)
Current reaction: nothing, deployment continues until all puppet rolesfinish, cluster is failed and probably very broken. Clicking deployagain will restart all failed roles same as in case 1.
desired reaction: This should stop automaticly, there is no reason towaste any more time on a deployment that won't work and has to becompletely re-run.
Case 3: For some reason nodes X,Y,Z have some problem (maybemisconfigured switch), user want's to abort deployment. (David's #1)
Current reaction, not possible
desired reaction, deployment should stop any running processesterminated. UI remains locked, user takes action, and then deploymentrestarts from any un-finished, or failed roles.
Case 4: For some reason user made large setting error and want's toreset the deployment (David's #2)
Current reaction, must destroy the whole cluster, loose config andmanually rebuild
desired reaction, if deployment is running, errored, deployed success,or otherwise, cluster is reset to un-deployed state, all nodes arereset to discovered state. All cluster and node settings should beretained. So its in same state as just before user clicked deploy thefirst time. All pages are unlocked.
Case 5: Sub-function, of 4; user want's to reset one or more nodestrivial.
Current reaction, must remove node, click deploy, wait for node torediscover, add back to cluster and re-deploy.
Desired reaction, node should go back to bootstrap state but retainall settings. node settings pages (nic, disk, roles) should beun-locked. For now, should not be allowed on controller as controllerscan't be cleaned up out of services yet (would require cluster reset # 4)

Can you elaborate please, why controller cannot be cleaned up by simplerejoining cluster(s)/roles from the scratch? Does non-atomic rolesprevent this? If so, which one sexactly? Could OSD/MON roles of Ceph be'redeployed' this way? AFAIK, the rejoin operation is simple scale-upoperation and must be supported for every cluster.


Case 6: User wants to change a setting after deployment. (Long term goal)

Current raction, not allowed

Desired reaction, general settings should be allowed to be changed,impacted roles should re-run to absorb change.

Now that we can run multiple roles, this would probably require thateach parameter and role related so that once changed we know whichrole to go re-run. Some parameters might be bad to change and maybestill wont be allowed to change.


Andrew
Mirantis

On Fri, Nov 22, 2013 at 12:15 PM, David Easter <deaster@xxxxxxxxxxxx<mailto:deaster@xxxxxxxxxxxx>> wrote:


    I think we have consensus.  Here's the way I'd paraphrase it, so
    please correct me if I'm wrong:

    Customer starts deployment, for example with 3 controllers (HA),
    10 compute nodes and 5 cinder nodes.  During the deployment, 2 of
    the compute nodes fail.  The customer does not want to wait for
    the entire deployment to "finish", so he presses the Stop
    Deployment button.

    At this point, the UI screens *remain locked* – I.e.
    configurations cannot be changed.  The user can correct the issues
    on the nodes if they are HW or OS related.  Once corrected, the
    user can click the Deploy Changes button and Fuel will retry
    installing any node that did not deploy correctly.  Fuel *will
    not* redeploy any nodes that successfully installed during the
    first Deploy Changes effort.

    If the user *does* want to make changes to the configuration (e.g.
    the disk layout on one of the compute nodes), then the user will
    have to select the *Reset Environment* button which will reset the
    environment to a state /as if the Deploy Changes had never been
    clicked/.  The UI will be *unlocked* and all previous choices will
    be retained.  The user can now make any changes to the
    environment.  Once the changes are made, the user can click Deploy
    Changes and Fuel will begin the deployment again from the beginning.


    Does that cover two backlog stories properly?

    Thanks,

    -Dave Easter

    From: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx
    <mailto:ralekseenkov@xxxxxxxxxxxx>>
    Date: Friday, November 22, 2013 3:53 AM
    To: Bogdan Dobrelya <bdobrelia@xxxxxxxxxxxx
    <mailto:bdobrelia@xxxxxxxxxxxx>>
    Cc: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx
    <mailto:mscherbakov@xxxxxxxxxxxx>>, David Easter
    <deaster@xxxxxxxxxxxx <mailto:deaster@xxxxxxxxxxxx>>, Evgeniy L
    <eli@xxxxxxxxxxxx <mailto:eli@xxxxxxxxxxxx>>, Nikolay Markov
    <nmarkov@xxxxxxxxxxxx <mailto:nmarkov@xxxxxxxxxxxx>>,
    "fuel-dev@xxxxxxxxxxxxxxxxxxx
    <mailto:fuel-dev@xxxxxxxxxxxxxxxxxxx>"
    <fuel-dev@xxxxxxxxxxxxxxxxxxx <mailto:fuel-dev@xxxxxxxxxxxxxxxxxxx>>

    Subject: Re: Stop deployment concerns

    David,

     1. Do we have a consensus here? Can you drive it with the team to
        completion?
     2. On a separate note, I think we should schedule a call to go
        through all the features and discuss requirements. To ensure
        that you and dev team are on the same page.

    Thanks,
    Roman

    On Friday, November 22, 2013, Bogdan Dobrelya wrote:

        On 11/22/2013 11:16 AM, Mike Scherbakov wrote:

        + fuel-dev

        We had a meeting on the topic yesterday. Research shows the
        following.

        It would be great to be able to stop deployment at any
        moment, and then continue with the redeployment only failed
        nodes. However:

          * If network configuration is changed - environment will
            not be operational after deployment
              o user may change net CIDRs, and without an additional
                functionality in Fuel it is not currently possible to
                reconfigure OpenStack (replace network information in
                OpenStack database)
          * If some settings are changed - the same
              o such as passwords, etc. - for example, controllers
                are already deployed, and computes will get new
                information

        So, we have come to the decision that resetting of the whole
        environment is essential at the moment. We expect the
        following workflow:

         1. If it becomes obvious that the deployment will not finish
            with the success, user goes to Actions tab and clicks on
            "Reset Environment" button.
         2. Environment changes the status to "Resetting"
         3. All settings on env become unlocked, and user is allowed
            to change anything. Settings stay the same as when user
            clicked "Deploy"
         4. Resetting of environment implies rebooting all the nodes
            to boostrap state. When it is done, status of env is
            changed to "New", and "Deploy" button becomes active.
         5. When user is done with re-configuration, he clicks
            "Deploy". Fuel should use same IP addresses / hostnames
            as at the time of initial deployment, if no changes are
            made to networking.

        Thanks,


        On Wed, Nov 20, 2013 at 7:14 PM, Mike Scherbakov
        <mscherbakov@xxxxxxxxxxxx> wrote:

            + Evgeniy, Nick


            On Wed, Nov 20, 2013 at 7:01 PM, David Easter
            <deaster@xxxxxxxxxxxx> wrote:

                I thought about this some more last night and what
                about this for a resolution?

                 1. When stop deployment is done, any successfully
                    deployed are flagged as successful and would not
                    be reinstalled when Deploy Changes is pressed again.
                 2. If a customer wants to reset the environment and
                    start over, they can use the "Reset environment"
                    option to wipe the partially installed
                    environment and start over.
                 3. Otherwise, when Deploy Changes is clicked again,
                    Fuel will try to deploy only the unfinished or
                    error-state nodes again… just as it does today.

                That way, the customer has the option of starting
                over or just continuing from where they left off.  If
                controllers or network install failed, Fuel would
                consider that an unrecoverable error condition and
                just reinstall those nodes

        1) I believe, we should reflect related Environment Operations
        changes in Nailgun API as well
        https://docs.google.com/a/mirantis.com/document/d/1KQPEG62wBF-U-s8mUzAcP3_rLKOBgyEyUY9e9yKE49U/edit#heading=h.qcspsp3wasyy
        2) Having an ability to reset the given node as well as the
        deployment, is vital for cluster self-healing. F.e., if we
        have STONITH'ed the failed controller node and want just
        redeploy it from the scratch, we might use nailgun API to
        reset the node to ensure it would be re-provisioned and
        re-deployed at the next boot...



                            1. this feature required only for
                            developers (or maybe services), because
                            in this case user will not be able to
                            reconfigure cluster via rest-api (i.e.
                            UI, CLI) after deployment was Stopped. If
                            we allow configuration, then deployment
                            in 90% cases likely to fail.
                            2. we cannot interrupt network
                            configuration being in progress, to
                            resolve this issue we need some kind of
                            recovery mechanism for networks
                            3. also we cannot interrupt apt-get (and
                            maybe yum) because it creates a lock file
                            and puppet will fail when we will try to
                            run it for a

        This body part will be downloaded on demand.

--Best regards,

        Bogdan Dobrelya,
        Researcher TechLead, Mirantis, Inc.
        +38 (066) 051 07 53  <tel:%2B38%20%28066%29%20051%2007%2053>
        Skypebogdando_at_yahoo.com  <http://bogdando_at_yahoo.com>
        38, Lenina ave.
        Kharkov, Ukraine
        www.mirantis.com  <http://www.mirantis.com>www.mirantis.ru  <http://www.mirantis.ru>bdobrelia@xxxxxxxxxxxx


    --
    Mailing list: https://launchpad.net/~fuel-dev
    <https://launchpad.net/%7Efuel-dev>
    Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
    <mailto:fuel-dev@xxxxxxxxxxxxxxxxxxx>
    Unsubscribe : https://launchpad.net/~fuel-dev
    <https://launchpad.net/%7Efuel-dev>
    More help   : https://help.launchpad.net/ListHelp




--
If google has done it, Google did it right!



--
Best regards,
Bogdan Dobrelya,
Researcher TechLead, Mirantis, Inc.
+38 (066) 051 07 53
Skype bogdando_at_yahoo.com
38, Lenina ave.
Kharkov, Ukraine
www.mirantis.com
www.mirantis.ru
bdobrelia@xxxxxxxxxxxx

Follow ups

Re: Stop deployment concerns
From: Andrew Woodward, 2013-11-25

References

Re: Stop deployment concerns
From: Roman Alekseenkov, 2013-11-22
Re: Stop deployment concerns
From: David Easter, 2013-11-22
Re: Stop deployment concerns
From: Andrew Woodward, 2013-11-23