yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #25546
[Bug 1402574] [NEW] No fault-tolerance in nova-scheduler
Public bug reported:
In the case a nova-scheduler service dies during processing (see below
how to reproduce it), the message is not rescheduled to another one in a
HA setup.
Oslo messaging raises a timeout in the conductor:
2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
Traceback (most recent call last):
File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
return getattr(self.instance, __name)(*args, **kwargs)
File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
request_spec=request_spec, filter_properties=filter_properties)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
retry=self.retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
timeout=timeout, retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
result = self._waiter.wait(msg_id, timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
reply, ending = self._poll_connection(msg_id, timer)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
% msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
The proper behavior would be to at least try once again, even in a
single machine setup - the message will be picked up by another server
or the same one when it restarts.
The Oslo messaging architecture doesn't support this being handled by
the AMQP server, so message rescheduling has to be implemented in Nova
(by the application logic).
To reproduce the error, I added ipdb.set_trace() in
nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
in the _schedule method.
** Affects: nova
Importance: Undecided
Assignee: Grzegorz Grasza (xek)
Status: In Progress
** Tags: nova-conductor nova-scheduler
** Description changed:
In the case a nova-scheduler server dies during processing (see below
- how I reproduce it), the message is not rescheduled to another one in a
+ how to reproduce it), the message is not rescheduled to another one in a
HA setup.
Oslo messaging raises a timeout in the conductor:
2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
- Traceback (most recent call last):
- File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
- request_spec, filter_properties)
- File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
- context, request_spec, filter_properties)
- File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
- return getattr(self.instance, __name)(*args, **kwargs)
- File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
- context, request_spec, filter_properties)
- File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
- request_spec=request_spec, filter_properties=filter_properties)
- File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
- retry=self.retry)
- File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
- timeout=timeout, retry=retry)
- File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
- retry=retry)
- File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
- result = self._waiter.wait(msg_id, timeout)
- File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
- reply, ending = self._poll_connection(msg_id, timer)
- File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
- % msg_id)
- MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
+ Traceback (most recent call last):
+ File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
+ request_spec, filter_properties)
+ File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
+ context, request_spec, filter_properties)
+ File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
+ return getattr(self.instance, __name)(*args, **kwargs)
+ File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
+ context, request_spec, filter_properties)
+ File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
+ request_spec=request_spec, filter_properties=filter_properties)
+ File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
+ retry=self.retry)
+ File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
+ timeout=timeout, retry=retry)
+ File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
+ retry=retry)
+ File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
+ result = self._waiter.wait(msg_id, timeout)
+ File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
+ reply, ending = self._poll_connection(msg_id, timer)
+ File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
+ % msg_id)
+ MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
The proper behavior would be to at least try once again, even in a
single machine setup - the message will be picked up by another server
or the same one when it restarts.
The Oslo messaging architecture doesn't support this being handled by
the AMQP server, so message rescheduling has to be implemented in Nova
(by the application logic).
-
- To reproduce the error, I added ipdb.set_trace() in nova/scheduler/filter_scheduler.py:287 before returning selected_hosts in the _schedule method.
+ To reproduce the error, I added ipdb.set_trace() in
+ nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
+ in the _schedule method.
** Description changed:
- In the case a nova-scheduler server dies during processing (see below
+ In the case a nova-scheduler service dies during processing (see below
how to reproduce it), the message is not rescheduled to another one in a
HA setup.
Oslo messaging raises a timeout in the conductor:
2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
Traceback (most recent call last):
File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
return getattr(self.instance, __name)(*args, **kwargs)
File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
request_spec=request_spec, filter_properties=filter_properties)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
retry=self.retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
timeout=timeout, retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
result = self._waiter.wait(msg_id, timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
reply, ending = self._poll_connection(msg_id, timer)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
% msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
The proper behavior would be to at least try once again, even in a
single machine setup - the message will be picked up by another server
or the same one when it restarts.
The Oslo messaging architecture doesn't support this being handled by
the AMQP server, so message rescheduling has to be implemented in Nova
(by the application logic).
To reproduce the error, I added ipdb.set_trace() in
nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
in the _schedule method.
** Changed in: nova
Assignee: (unassigned) => Grzegorz Grasza (xek)
** Changed in: nova
Status: New => In Progress
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1402574
Title:
No fault-tolerance in nova-scheduler
Status in OpenStack Compute (Nova):
In Progress
Bug description:
In the case a nova-scheduler service dies during processing (see below
how to reproduce it), the message is not rescheduled to another one in
a HA setup.
Oslo messaging raises a timeout in the conductor:
2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
Traceback (most recent call last):
File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
return getattr(self.instance, __name)(*args, **kwargs)
File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
context, request_spec, filter_properties)
File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
request_spec=request_spec, filter_properties=filter_properties)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
retry=self.retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
timeout=timeout, retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
retry=retry)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
result = self._waiter.wait(msg_id, timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
reply, ending = self._poll_connection(msg_id, timer)
File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
% msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
The proper behavior would be to at least try once again, even in a
single machine setup - the message will be picked up by another server
or the same one when it restarts.
The Oslo messaging architecture doesn't support this being handled by
the AMQP server, so message rescheduling has to be implemented in Nova
(by the application logic).
To reproduce the error, I added ipdb.set_trace() in
nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
in the _schedule method.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1402574/+subscriptions
Follow ups
References