← Back to team overview

maas-devel team mailing list archive

Re: RFC: "Serialising" power actions

 

On 29/09/14 14:25, Gavin Panella wrote:
> On 23 September 2014 11:02, Mark Shuttleworth <mark@xxxxxxxxxx> wrote:
>> On 17/09/14 09:58, Gavin Panella wrote:
>>>>  * Storing state in the pserv without a means to recover it is a
>>>> recipe for disaster
>>> I guess you mean that a crash or restart in pserv would mean that
>>> in-progress power commands wouldn't be resumed. That's true, but it's
>>> not a disaster. It means that for nodes in all states but DEPLOYED we
>>> need to wait for the periodic power monitor to notice and reissue a
>>> command (see later; it doesn't do this yet). For DEPLOYED nodes, sure,
>>> the command will currently be lost, but these nodes are, one assumes,
>>> under active management, and some process outside of MAAS will notice,
>>> be that a human or a Juju or something else.
>> So (a) guess
> Not really. For states other than deployed we can take a stance on what
> power state the node ought to be in, and get MAAS to converge on that.

"We can take a stance" means we have some interpolation algorithm which
we have to explain to users, and then of course when we change our mind
and make it better we have to explain it again. So most users think
"damn it I have no idea what to expect with this thing".

Much better NOT to take a stance, maintain the queue of things users
told it to do, and be damn good about getting through that queue with
clear feedback to the user as to:

 * who asked for what
 * what the results of each of those efforts were (success or failure)
 * where we are now in the queue
 * what's still to come.

That's dumber but much easier for users to understand - MAAS is Just
Doing What We Told It To Do.


>> and (b) hope someone else cleans up the problem?
> It's not ideal, but it's better than what we had before, where MAAS did
> remember all outstanding power commands issued (unless RabbitMQ broke),
> but then ran them concurrently, and didn't give any feedback.

Not a high enough bar to be relevant ;)


> It's not a /disaster/ because we're not going to be restarting cluster
> controllers frequently, and crashes too will hopefully be infrequent.
> Losing in-progress power changes is a relatively small problem compared
> to the above.

Gavin, you have to stop saying things like "relatively infrequent" as a
justification for a bad outcome. "It's OK if it happens relatively
infrequently" is not how our customers think. They pay us SPECIFICALLY
TO AVOID infrequent and painful outages.

We had this same problem with DHCP; don't design for "works almost all
the time" or justify a bad outcome by saying "it should be an infrequent
event". That's just going to cost us all our credibility, not something
that you can afford to squander on everyone else's behalf.

There is a different mindset required here: take pleasure in making it
DEFINITELY GOOD (knowing there will be bugs to be fixed). You should
feel very uncomfortable if you're extending an argument in favour of
something which "probably won't fail very often", because you're using
language that's just completely unreassuring to our audience.

> Solving this isn't a code problem, it's about the behaviour we'd want:
> restarting in-progress commands when a cluster controller comes back up
> /might/ be the wrong thing to do.

We have the queue of things we want to do, which we make visible to the
user. We say that we are currently NOT pursuing that queue, because the
relevant controller is down (visibility). And we let the user manipulate
the queue (cancel commands in the queue) if they want to have a
different outcome when the controller comes back.

Be explicit and clear.

> For example, given notice of an imminent power outage, I send a
> power-off command to all my nodes. The power fails prematurely (or I was
> late issuing the command) and my whole cluster goes suddenly dark. When
> the power is restored the cluster controller needs to do a lengthy fsck,
> or just boots slowly. The nodes in my cluster are okay, and they boot as
> soon as power returns, if set in the BIOS, or I switch them on by hand,
> so that service is restored for my customers. A few minutes later the
> cluster controller finishes booting and it resumes all in-progress power
> commands, turning all my nodes off.
>
> Right now I'm not sure if we can completely codify what to do after an
> outage or crash.

We present the queue, we let the user manipulate and correct it.

>  We might be able to address the hypothetical situation
> above by putting an expiry time on each power command, but for how long
> should that be? That would need discussion and/or experimentation.

No, we just let the user manipulate the queue themselves. THEY can
interpolate and tell us what they actually want if they want to skip
some of the steps.

> Perhaps the next thing /is/ to blindly resume in-progress commands, then
> we can refine iteratively from there.

Be good at doing what we are told to do, reliably. Be good at letting
the user tell us what they want to do.

Mark

Attachment: signature.asc
Description: OpenPGP digital signature


Follow ups

References