← Back to team overview

openstack team mailing list archive

Re: how to deal with failed compute node

 

What about temporary failures in the nova-compute actually being working
but just not reporting.

Something like network separations (oops I disconnected the switch...)
where both are still working but just unaware of each other. If nova-api
starts cleaning up the running instance that might be slightly bad right?
Since it is actually still running and useable.

I'd rather almost stick to the more pre-cautious 'state' where either it
can be manually deleted (by the user) via something like a 'force remove'.
Having auto-remove being triggered by multiple components (somewhat
partially it seems) on service 'aliveness' seems risky in times where vm's
are still working+running but a service has become disconnected
temporarily. Shouldn't whichever VM's are 'alive' continue to be 'alive'
even if a nova service dies, I would suspect customers wouldn't want there
VM lifetime connected to some external component (in this case
nova-computes up/downtime).

On 10/17/12 11:23 PM, "gtt116" <gtt116@xxxxxxx> wrote:

>Hi guys,
>
>Today, when terminate an instance, nova-api will check whether
>nova-compute service is alive. If nova-compute is dead, nova-api just
>delete the instance from the database, but do not release the fixed-ip,
>floating-ip, volumes, etc. If the failed nova-compute start again, it
>will found the erroneously running instance, and do cleanup. But before
>the nova-compute started, the resource that dead vm associated can not
>be used. like fixed-ip can not be associated to another vm.
>
>So I found a method to quickly clean these resource. If nova-api find
>nova-compute is dead. Then it find another nova-compute that is alive.
>Although the alive nova-compute is not the real host of vm. It can clean
>the resource in database, even the network by make rpc call to
>nova-network. maybe some exception it will raise. But that works. What
>do you think about this?
>
>why do we have a lot of nova-compute, nova-network? I think one reason
>is when one node failed, another can do some work for it.
>
>Best regards,
>gtt
>
>
>_______________________________________________
>Mailing list: https://launchpad.net/~openstack
>Post to     : openstack@xxxxxxxxxxxxxxxxxxx
>Unsubscribe : https://launchpad.net/~openstack
>More help   : https://help.launchpad.net/ListHelp



References