← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1308981] [NEW] Nova-compute does not recover controller switch over

 

Public bug reported:

I have two controllers which form a RabbitMQ cluster and then a compute
node. The problem occurs when I have all the nodes first up and then I
shut down one of the controllers. Then in nova.log in compute node the
below exception is logged.

<182>Apr 17 14:36:37 compute-01 nova-nova.compute.resource_tracker INFO: Compute_service record updated for compute-01:compute-01.trelab.tieto.com
<179>Apr 17 14:37:42 compute-01 nova-nova.servicegroup.drivers.db ERROR: model server went away
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", line 96, in _report_state
    service.service_ref, state_catalog)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 269, in service_update
    return self._manager.service_update(context, service, values)
  File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 397, in service_update
    service=service_p, values=values)
  File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 85, in call
    return self._invoke(self.proxy.call, ctxt, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 63, in _invoke
    return cast_or_call(ctxt, msg, **self.kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 130, in call
    exc.info, real_topic, msg.get('method'))
Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "service_update" info: "<unknown>"
<180>Apr 17 14:38:08 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
<182>Apr 17 14:40:09 compute-01 nova-nova.compute.manager INFO: Updating bandwidth usage cache
<180>Apr 17 14:44:39 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources

The compute node goes to "down" state in nova service-list and it does
not recover. Only when I start the other controller again the compute
node recovers. Sometimes it is needed to restart nova-compute to
recover.

I have a havana level system. In the system I have upgraded RabbitMQ to
3.2.4 version and created a policy so that mirrored queues are used in
RabbitMQ.

$ rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'

rabbitmqctl cluster_status is showing both controllers as running nodes.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1308981

Title:
  Nova-compute does not recover controller switch over

Status in OpenStack Compute (Nova):
  New

Bug description:
  I have two controllers which form a RabbitMQ cluster and then a
  compute node. The problem occurs when I have all the nodes first up
  and then I shut down one of the controllers. Then in nova.log in
  compute node the below exception is logged.

  <182>Apr 17 14:36:37 compute-01 nova-nova.compute.resource_tracker INFO: Compute_service record updated for compute-01:compute-01.trelab.tieto.com
  <179>Apr 17 14:37:42 compute-01 nova-nova.servicegroup.drivers.db ERROR: model server went away
  Traceback (most recent call last):
    File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", line 96, in _report_state
      service.service_ref, state_catalog)
    File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 269, in service_update
      return self._manager.service_update(context, service, values)
    File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 397, in service_update
      service=service_p, values=values)
    File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 85, in call
      return self._invoke(self.proxy.call, ctxt, method, **kwargs)
    File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 63, in _invoke
      return cast_or_call(ctxt, msg, **self.kwargs)
    File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 130, in call
      exc.info, real_topic, msg.get('method'))
  Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "service_update" info: "<unknown>"
  <180>Apr 17 14:38:08 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
  <182>Apr 17 14:40:09 compute-01 nova-nova.compute.manager INFO: Updating bandwidth usage cache
  <180>Apr 17 14:44:39 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources

  The compute node goes to "down" state in nova service-list and it does
  not recover. Only when I start the other controller again the compute
  node recovers. Sometimes it is needed to restart nova-compute to
  recover.

  I have a havana level system. In the system I have upgraded RabbitMQ
  to 3.2.4 version and created a policy so that mirrored queues are used
  in RabbitMQ.

  $ rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'

  rabbitmqctl cluster_status is showing both controllers as running
  nodes.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1308981/+subscriptions


Follow ups

References