openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #12487
Re: Nova-compute doesn't start on reboot, only manually
Thank you Russel, i'll wait for the fix
Best Regards
2012/5/29 Russell Bryant <rbryant@xxxxxxxxxx>
> On 05/28/2012 01:21 PM, Clint Byrum wrote:
> > Looks to me that you need to make sure the other side of that RPC
> > connection is up before nova-compute. I am not familiar with the
> specifics
> > of what Nova needs at startup, but I'd guess this is nova-api or
> keystone.
> > Thats a pretty easy thing to do in a single system (just mess with the
> > upstart jobs or init scripts) but across multiple systems, you'll need
> > some kind of orchestration layer, and even then modeling the dependencies
> > on the network with some other tool seems like something just begging
> > to break.
>
> In this case, it's nova-compute expecting nova-network to be up and
> running when it starts up. This also causes a problem when restarting
> all of the services at the same time, as seen here:
>
> https://bugs.launchpad.net/nova/+bug/999698
>
> > Instead, the timeout should just be multiple minutes during startup, and
> > the services should all be able to start in parallel if they are on the
> > same box. I always think of one of those HP EcoPOD that is pre-installed
> > with everything you need for OpenStack, and just shipped and then turned
> > on. You could spend a lot of time trying to get that order just right,
> > or you could just have everything extend their timeouts and get as far
> > as they can without contact with the other services.
> >
> > nova-compute doesn't *know* that the other side is in error, it just
> > knows that it is not responding. This is not a problem with nova-compute,
> > so why should nova-compute fail so quickly? One could even argue that
> > nova-compute should wait *forever* for the other side. From an ops
> > standpoint, they're both "down", so why make the operations team take
> > two actions when the actual broken service recovers?
>
> The problem is that since nova-network isn't up, the request gets lost.
> nova-compute is sitting there waiting for a response to a message that
> was never even received most likely. It's also possible that
> nova-network received the message but the service stopped before it
> responded (but that is less likely, I think).
>
> The message queues get created by the consumer of messages in nova. So,
> in this case, nova-network creates the queue. Some possible solutions:
>
> 1) We could adjust this code path to just loop around and try again if
> it hits a timeout. We could make the timeout much shorter than the
> default, to make recover quicker.
>
> The downside would be that we're fixing a single place, when this issue
> could pop up elsewhere.
>
> 2) We could make it so the sender creates the queue if it doesn't exist.
>
> This is good because it covers all cases. The bad thing is that we
> would not be able to set the queue to be auto-deleted in this case, so
> we could end up with a "leak" of unwanted message queues.
>
>
> I'm tempted to just write a patch that does #1 for now to address the
> immediate issue and then do something better later if we come up with
> something.
>
> --
> Russell Bryant
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help : https://help.launchpad.net/ListHelp
>
Follow ups
References