← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1402574] Re: No fault-tolerance in nova-scheduler

 

** Changed in: nova
       Status: Fix Committed => Fix Released

** Changed in: nova
    Milestone: None => kilo-2

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1402574

Title:
  No fault-tolerance in nova-scheduler

Status in OpenStack Compute (Nova):
  Fix Released

Bug description:
  In the case a nova-scheduler service dies during processing (see below
  how to reproduce it), the message is not rescheduled to another one in
  a HA setup.

  Oslo messaging raises a timeout in the conductor:

  2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance
   Traceback (most recent call last):
     File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances
       request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations
       context, request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method
       return getattr(self.instance, __name)(*args, **kwargs)
     File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations
       context, request_spec, filter_properties)
     File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations
       request_spec=request_spec, filter_properties=filter_properties)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
       retry=self.retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
       timeout=timeout, retry=retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send
       retry=retry)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send
       result = self._waiter.wait(msg_id, timeout)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait
       reply, ending = self._poll_connection(msg_id, timer)
     File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection
       % msg_id)
   MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331

  The proper behavior would be to at least try once again, even in a
  single machine setup - the message will be picked up by another server
  or the same one when it restarts.

  The Oslo messaging architecture doesn't support this being handled by
  the AMQP server, so message rescheduling has to be implemented in Nova
  (by the application logic).

  To reproduce the error, I added ipdb.set_trace() in
  nova/scheduler/filter_scheduler.py:287 before returning selected_hosts
  in the _schedule method.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1402574/+subscriptions


References