openstack team mailing list archive

Thread
Date

Re: Nova-compute doesn't start on reboot, only manually

To: openstack@xxxxxxxxxxxxxxxxxxx
From: Russell Bryant <rbryant@xxxxxxxxxx>
Date: Tue, 29 May 2012 12:55:49 -0400
Cc: Chris Behrens <cbehrens@xxxxxxxxxxxx>
In-reply-to: <1338221771-sup-8854@fewbar.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

On 05/28/2012 01:21 PM, Clint Byrum wrote:
> Looks to me that you need to make sure the other side of that RPC
> connection is up before nova-compute. I am not familiar with the specifics
> of what Nova needs at startup, but I'd guess this is nova-api or keystone.
> Thats a pretty easy thing to do in a single system (just mess with the
> upstart jobs or init scripts) but across multiple systems, you'll need
> some kind of orchestration layer, and even then modeling the dependencies
> on the network with some other tool seems like something just begging
> to break.

In this case, it's nova-compute expecting nova-network to be up and
running when it starts up.  This also causes a problem when restarting
all of the services at the same time, as seen here:

https://bugs.launchpad.net/nova/+bug/999698

> Instead, the timeout should just be multiple minutes during startup, and
> the services should all be able to start in parallel if they are on the
> same box. I always think of one of those HP EcoPOD that is pre-installed
> with everything you need for OpenStack, and just shipped and then turned
> on. You could spend a lot of time trying to get that order just right,
> or you could just have everything extend their timeouts and get as far
> as they can without contact with the other services.
> 
> nova-compute doesn't *know* that the other side is in error, it just
> knows that it is not responding. This is not a problem with nova-compute,
> so why should nova-compute fail so quickly? One could even argue that
> nova-compute should wait *forever* for the other side. From an ops
> standpoint, they're both "down", so why make the operations team take
> two actions when the actual broken service recovers?

The problem is that since nova-network isn't up, the request gets lost.
 nova-compute is sitting there waiting for a response to a message that
was never even received most likely.  It's also possible that
nova-network received the  message but the service stopped before it
responded (but that is less likely, I think).

The message queues get created by the consumer of messages in nova.  So,
in this case, nova-network creates the queue.  Some possible solutions:

1) We could adjust this code path to just loop around and try again if
it hits a timeout.  We could make the timeout much shorter than the
default, to make recover quicker.

The downside would be that we're fixing a single place, when this issue
could pop up elsewhere.

2) We could make it so the sender creates the queue if it doesn't exist.

This is good because it covers all cases.  The bad thing is that we
would not be able to set the queue to be auto-deleted in this case, so
we could end up with a "leak" of unwanted message queues.

I'm tempted to just write a patch that does #1 for now to address the
immediate issue and then do something better later if we come up with
something.

-- 
Russell Bryant

Follow ups

Re: Nova-compute doesn't start on reboot, only manually
From: Alessandro Tagliapietra, 2012-05-30

References

Nova-compute doesn't start on reboot, only manually
From: Alessandro Tagliapietra, 2012-05-28
Re: Nova-compute doesn't start on reboot, only manually
From: Clint Byrum, 2012-05-28