← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1402574] [NEW] No fault-tolerance in nova-scheduler

 

Public bug reported:

In the case a nova-scheduler service dies during processing (see below
how to reproduce it), the message is not rescheduled to another one in a
HA setup.

Oslo messaging raises a timeout in the conductor:

2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
 Traceback (most recent call last):
   File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
     request_spec, filter_properties)
   File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
     context, request_spec, filter_properties)
   File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
     return getattr(self.instance, __name)(*args, **kwargs)
   File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
     context, request_spec, filter_properties)
   File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
     request_spec=request_spec, filter_properties=filter_properties)
   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
     retry=self.retry)
   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
     timeout=timeout, retry=retry)
   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
     retry=retry)
   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
     result = self._waiter.wait(msg_id, timeout)
   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
     reply, ending = self._poll_connection(msg_id, timer)
   File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
     % msg_id)
 MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331

The proper behavior would be to at least try once again, even in a
single machine setup - the message will be picked up by another server
or the same one when it restarts.

The Oslo messaging architecture doesn't support this being handled by
the AMQP server, so message rescheduling has to be implemented in Nova
(by the application logic).

To reproduce the error, I added ipdb.set_trace() in
nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
in the _schedule method.

** Affects: nova
     Importance: Undecided
     Assignee: Grzegorz Grasza (xek)
         Status: In Progress


** Tags: nova-conductor nova-scheduler

** Description changed:

  In the case a nova-scheduler server dies during processing (see below
- how I reproduce it), the message is not rescheduled to another one in a
+ how to reproduce it), the message is not rescheduled to another one in a
  HA setup.
  
  Oslo messaging raises a timeout in the conductor:
  
  2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
-  Traceback (most recent call last):
-    File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
-      request_spec, filter_properties)
-    File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
-      context, request_spec, filter_properties)
-    File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
-      return getattr(self.instance, __name)(*args, **kwargs) 
-    File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
-      context, request_spec, filter_properties)
-    File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
-      request_spec=request_spec, filter_properties=filter_properties)
-    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
-      retry=self.retry)
-    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
-      timeout=timeout, retry=retry)
-    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
-      retry=retry)
-    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
-      result = self._waiter.wait(msg_id, timeout)
-    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
-      reply, ending = self._poll_connection(msg_id, timer)
-    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
-      % msg_id)
-  MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
+  Traceback (most recent call last):
+    File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
+      request_spec, filter_properties)
+    File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
+      context, request_spec, filter_properties)
+    File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
+      return getattr(self.instance, __name)(*args, **kwargs)
+    File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
+      context, request_spec, filter_properties)
+    File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
+      request_spec=request_spec, filter_properties=filter_properties)
+    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
+      retry=self.retry)
+    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
+      timeout=timeout, retry=retry)
+    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
+      retry=retry)
+    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
+      result = self._waiter.wait(msg_id, timeout)
+    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
+      reply, ending = self._poll_connection(msg_id, timer)
+    File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
+      % msg_id)
+  MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
  
  The proper behavior would be to at least try once again, even in a
  single machine setup - the message will be picked up by another server
  or the same one when it restarts.
  
  The Oslo messaging architecture doesn't support this being handled by
  the AMQP server, so message rescheduling has to be implemented in Nova
  (by the application logic).
  
- 
- To reproduce the error, I added ipdb.set_trace() in nova/scheduler/filter_scheduler.py:287 before returning selected_hosts in the _schedule method.
+ To reproduce the error, I added ipdb.set_trace() in
+ nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
+ in the _schedule method.

** Description changed:

- In the case a nova-scheduler server dies during processing (see below
+ In the case a nova-scheduler service dies during processing (see below
  how to reproduce it), the message is not rescheduled to another one in a
  HA setup.
  
  Oslo messaging raises a timeout in the conductor:
  
  2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
   Traceback (most recent call last):
     File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
       request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
       context, request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
       return getattr(self.instance, __name)(*args, **kwargs)
     File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
       context, request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
       request_spec=request_spec, filter_properties=filter_properties)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
       retry=self.retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
       timeout=timeout, retry=retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
       retry=retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
       result = self._waiter.wait(msg_id, timeout)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
       reply, ending = self._poll_connection(msg_id, timer)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
       % msg_id)
   MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331
  
  The proper behavior would be to at least try once again, even in a
  single machine setup - the message will be picked up by another server
  or the same one when it restarts.
  
  The Oslo messaging architecture doesn't support this being handled by
  the AMQP server, so message rescheduling has to be implemented in Nova
  (by the application logic).
  
  To reproduce the error, I added ipdb.set_trace() in
  nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
  in the _schedule method.

** Changed in: nova
     Assignee: (unassigned) => Grzegorz Grasza (xek)

** Changed in: nova
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1402574

Title:
  No fault-tolerance in nova-scheduler

Status in OpenStack Compute (Nova):
  In Progress

Bug description:
  In the case a nova-scheduler service dies during processing (see below
  how to reproduce it), the message is not rescheduled to another one in
  a HA setup.

  Oslo messaging raises a timeout in the conductor:

  2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
   Traceback (most recent call last):
     File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
       request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
       context, request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
       return getattr(self.instance, __name)(*args, **kwargs)
     File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
       context, request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
       request_spec=request_spec, filter_properties=filter_properties)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
       retry=self.retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
       timeout=timeout, retry=retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
       retry=retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
       result = self._waiter.wait(msg_id, timeout)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
       reply, ending = self._poll_connection(msg_id, timer)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
       % msg_id)
   MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331

  The proper behavior would be to at least try once again, even in a
  single machine setup - the message will be picked up by another server
  or the same one when it restarts.

  The Oslo messaging architecture doesn't support this being handled by
  the AMQP server, so message rescheduling has to be implemented in Nova
  (by the application logic).

  To reproduce the error, I added ipdb.set_trace() in
  nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
  in the _schedule method.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1402574/+subscriptions


Follow ups

References