yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #28136
[Bug 1402574] Re: No fault-tolerance in nova-scheduler
** Changed in: nova
Status: Fix Committed => Fix Released
** Changed in: nova
Milestone: None => kilo-2
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1402574
Title:
No fault-tolerance in nova-scheduler
Status in OpenStack Compute (Nova):
Fix Released
Bug description:
In the case a nova-scheduler service dies during processing (see below
how to reproduce it), the message is not rescheduled to another one in
a HA setup.
Oslo messaging raises a timeout in the conductor:
2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
Traceback (most recent call last):
File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
return getattr(self.instance, __name)(*args, **kwargs)
File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
request_spec=request_spec, filter_properties=filter_properties)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
retry=self.retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
timeout=timeout, retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
result = self._waiter.wait(msg_id, timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
reply, ending = self._poll_connection(msg_id, timer)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
% msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
The proper behavior would be to at least try once again, even in a
single machine setup - the message will be picked up by another server
or the same one when it restarts.
The Oslo messaging architecture doesn't support this being handled by
the AMQP server, so message rescheduling has to be implemented in Nova
(by the application logic).
To reproduce the error, I added ipdb.set_trace() in
nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
in the _schedule method.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1402574/+subscriptions
References