+ fuel-dev
We had a meeting on the topic yesterday. Research shows the following.
It would be great to be able to stop deployment at any moment, and
then continue with the redeployment only failed nodes. However:
* If network configuration is changed - environment will not be
operational after deployment
o user may change net CIDRs, and without an additional
functionality in Fuel it is not currently possible to
reconfigure OpenStack (replace network information in
OpenStack database)
* If some settings are changed - the same
o such as passwords, etc. - for example, controllers are already
deployed, and computes will get new information
So, we have come to the decision that resetting of the whole
environment is essential at the moment. We expect the following workflow:
1. If it becomes obvious that the deployment will not finish with the
success, user goes to Actions tab and clicks on "Reset
Environment" button.
2. Environment changes the status to "Resetting"
3. All settings on env become unlocked, and user is allowed to change
anything. Settings stay the same as when user clicked "Deploy"
4. Resetting of environment implies rebooting all the nodes to
boostrap state. When it is done, status of env is changed to
"New", and "Deploy" button becomes active.
5. When user is done with re-configuration, he clicks "Deploy". Fuel
should use same IP addresses / hostnames as at the time of initial
deployment, if no changes are made to networking.
Thanks,
On Wed, Nov 20, 2013 at 7:14 PM, Mike Scherbakov
<mscherbakov@xxxxxxxxxxxx <mailto:mscherbakov@xxxxxxxxxxxx>> wrote:
+ Evgeniy, Nick
On Wed, Nov 20, 2013 at 7:01 PM, David Easter
<deaster@xxxxxxxxxxxx <mailto:deaster@xxxxxxxxxxxx>> wrote:
I thought about this some more last night and what about this
for a resolution?
1. When stop deployment is done, any successfully deployed
are flagged as successful and would not be reinstalled
when Deploy Changes is pressed again.
2. If a customer wants to reset the environment and start
over, they can use the "Reset environment" option to wipe
the partially installed environment and start over.
3. Otherwise, when Deploy Changes is clicked again, Fuel will
try to deploy only the unfinished or error-state nodes
again... just as it does today.
That way, the customer has the option of starting over or just
continuing from where they left off. If controllers or
network install failed, Fuel would consider that an
unrecoverable error condition and just reinstall those
nodes/components again -- just like today.
This would also help to eliminate the confusion about when to
use reset environment. You'd always use them in tandem if you
wanted to start over -- I.e. reset environment is used either
during an aborted deployment or after a cloud has been up for
awhile; either case.
Thanks,
-Dave Easter
From: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx
<mailto:ralekseenkov@xxxxxxxxxxxx>>
Date: Tuesday, November 19, 2013 8:28 AM
To: David Easter <deaster@xxxxxxxxxxxx
<mailto:deaster@xxxxxxxxxxxx>>
Cc: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx
<mailto:mscherbakov@xxxxxxxxxxxx>>
Subject: Re: Stop deployment concerns
David,
I'll let Mike comment regarding how feasible it is to recover
from failures of individual nodes. AFAIK, there are quite a
few different scenarios, e.g. restarting a failed controller
node may be harder that restarting a compute node.
Can you rewrite the wiki page with requirements, so Mike and I
can review it tomorrow? In particular:
* remove the notion of pause/unpause
o user should either wait until completion of deployment
process
o or they can abort in the middle, if they choose so
o if some nodes fail, user should still wait while the
whole deployment finishes before doing any more
actions, such as restarting the nodes (that's the way
how it is designed --- you accumulate changes, then
"push" them at once using a deploy button)
* explain the use cases & desired paths to recover from
them. e.g.
o you are deploying from scratch and the whole
deployment fails
o you are deploying from scratch and some nodes fail
o you are doing an incremental deployment and adding a
new compute (ceph, etc) node to the cluster, and it fails
o what happens if you add and remove nodes in a single
"push"? what happens if you attempt to terminate an
action like this in the middle?
o etc
We need to carefully think through all scenarios.
Thanks,
Roman
On Tuesday, November 19, 2013, David Easter wrote:
Mike / Roman,
My understanding in conversations with Services is that
there is a scenario where much of the deployment runs
properly -- e.g. controllers have successfully been
deployed and several compute nodes successfully deploy.
However, a particular node may encounter an error
condition that requires the admin to perform an action --
sometimes making an OS level or HW change but sometimes
just having to redo the install against that node. In
fact, it happens to me all the time on my laptop when
installing on VirtualBox where one of the nodes fails to
install due to resource contention.
I feel it is a completely valid scenario where the
deployment of tends of nodes may be occurring and one node
in the middle of the deployment fails. Rather than
wiping out all of the nodes and starting over, one would
want to just retry the nodes that failed. My
understanding is that we can do that today, in fact, after
the deployment finishes with errors so why would it not
also be a scenario where the deployment is stopped. All
I'm saying is that there should be an ability to retry the
nodes that failed or were aborted vs. wiping all the nodes
and starting over.
So for a particular node, I'm not suggesting that the
node itself not be reset to bootstrap. However, I'm
saying for the environment, everything should not be reset
to bootstrap -- only the nodes that failed.
Thanks,
-Dave Easter
From: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
Date: Tuesday, November 19, 2013 5:11 AM
To: Roman Alekseenkov <ralekseenkov@xxxxxxxxxxxx>
Cc: David Easter <deaster@xxxxxxxxxxxx>
Subject: Re: Stop deployment concerns
Roman,
see inline.
On Tue, Nov 19, 2013 at 5:04 PM, Roman Alekseenkov
<ralekseenkov@xxxxxxxxxxxx> wrote:
I strongly disagree with some of the statements. We
need to define first what problem we are trying to solve.
For me, the key value of "stop deployment" feature is
to abort deployment in case something went wrong (e.g.
anaconda failed, and UI is stuck for 90 minutes
waiting for its completion). Forcing user to wait
without the ability to "stop deployment", knowing that
something is already wrong, is not smart.
The same I speak about in "resetting" - we abort
deployment, and reset the env to the initial stage.
Customer does not wait.
I don't care about "resuming" deployment after it has
been stopped. All I care about is a quick way to
"stop" -> "reset" -> "install from scratch". Thus, I
don't care much about terminating apt-get in the
middle, etc. as the nodes will be wiped out anyway.
Neither we do in "resetting", but we do in "stop
deployment", which implies "resuming" after some configs
manipulation.
Moreover, I think "abort deployment" is a more
appropriate term here than "stop deployment".
I agree, but it still means that we abort puppet run /
whatever other phase and hope that we can resume from the
point where we aborted.
David - let me know if you have a different use case
in mind. I just read the requirements and saw words
about pausing the deployment, restarting the
installation process from where it was paused, and so
on. Why did we add these? There doesn't seem to be a
reliable way of pausing and resuming deployment.
Agree.
So with all comments, I tried to say the same in my
initial email. And that's why we suggest to concentrate
the work on aborting deployment with further reset to
bootstrap state, which falls into "Resetting environment"
feature - and this will be reliable way for:
* aborting deployment
* modification of configuration (not creating new env,
just modifying the existing one)
* starting deployment from the beginning with new
configuration
Thanks,
Roman
On Tuesday, November 19, 2013, Mike Scherbakov wrote:
David, Roman,
research behind Stop Deployment
<https://mirantis.jira.com/wiki/display/PRD/Stop+Deployment> feature
which is #4 must have in 4.0 shows a number of
concerns: