yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1308981] Re: Nova-compute does not recover controller switch over

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sean Dague <sean@xxxxxxxxx>
Date: Wed, 17 Sep 2014 13:45:16 -0000
Reply-to: Bug 1308981 <1308981@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

** Changed in: nova
       Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1308981

Title:
  Nova-compute does not recover controller switch over

Status in OpenStack Compute (Nova):
  Invalid

Bug description:
  I have two controllers which form a RabbitMQ cluster and then a
  compute node. The problem occurs when I have all the nodes first up
  and then I shut down one of the controllers. Then in nova.log in
  compute node the below exception is logged.

  <182>Apr 17 14:36:37 compute-01 nova-nova.compute.resource_tracker INFO: Compute_service record updated for compute-01:compute-01.trelab.tieto.com
  <179>Apr 17 14:37:42 compute-01 nova-nova.servicegroup.drivers.db ERROR: model server went away
  Traceback (most recent call last):
    File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", line 96, in _report_state
      service.service_ref, state_catalog)
    File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 269, in service_update
      return self._manager.service_update(context, service, values)
    File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 397, in service_update
      service=service_p, values=values)
    File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 85, in call
      return self._invoke(self.proxy.call, ctxt, method, **kwargs)
    File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 63, in _invoke
      return cast_or_call(ctxt, msg, **self.kwargs)
    File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 130, in call
      exc.info, real_topic, msg.get('method'))
  Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "service_update" info: "<unknown>"
  <180>Apr 17 14:38:08 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
  <182>Apr 17 14:40:09 compute-01 nova-nova.compute.manager INFO: Updating bandwidth usage cache
  <180>Apr 17 14:44:39 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources

  The compute node goes to "down" state in nova service-list and it does
  not recover. Only when I start the other controller again the compute
  node recovers. Sometimes it is needed to restart nova-compute to
  recover.

  I have a havana level system. In the system I have upgraded RabbitMQ
  to 3.2.4 version and created a policy so that mirrored queues are used
  in RabbitMQ.

  $ rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'

  rabbitmqctl cluster_status is showing both controllers as running
  nodes.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1308981/+subscriptions

References

[Bug 1308981] [NEW] Nova-compute does not recover controller switch over
From: Pekka Rinne, 2014-04-17