yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #13688
[Bug 1308981] [NEW] Nova-compute does not recover controller switch over
Public bug reported:
I have two controllers which form a RabbitMQ cluster and then a compute
node. The problem occurs when I have all the nodes first up and then I
shut down one of the controllers. Then in nova.log in compute node the
below exception is logged.
<182>Apr 17 14:36:37 compute-01 nova-nova.compute.resource_tracker INFO: Compute_service record updated for compute-01:compute-01.trelab.tieto.com
<179>Apr 17 14:37:42 compute-01 nova-nova.servicegroup.drivers.db ERROR: model server went away
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", line 96, in _report_state
service.service_ref, state_catalog)
File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 269, in service_update
return self._manager.service_update(context, service, values)
File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 397, in service_update
service=service_p, values=values)
File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 85, in call
return self._invoke(self.proxy.call, ctxt, method, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 63, in _invoke
return cast_or_call(ctxt, msg, **self.kwargs)
File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 130, in call
exc.info, real_topic, msg.get('method'))
Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "service_update" info: "<unknown>"
<180>Apr 17 14:38:08 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
<182>Apr 17 14:40:09 compute-01 nova-nova.compute.manager INFO: Updating bandwidth usage cache
<180>Apr 17 14:44:39 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
The compute node goes to "down" state in nova service-list and it does
not recover. Only when I start the other controller again the compute
node recovers. Sometimes it is needed to restart nova-compute to
recover.
I have a havana level system. In the system I have upgraded RabbitMQ to
3.2.4 version and created a policy so that mirrored queues are used in
RabbitMQ.
$ rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
rabbitmqctl cluster_status is showing both controllers as running nodes.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1308981
Title:
Nova-compute does not recover controller switch over
Status in OpenStack Compute (Nova):
New
Bug description:
I have two controllers which form a RabbitMQ cluster and then a
compute node. The problem occurs when I have all the nodes first up
and then I shut down one of the controllers. Then in nova.log in
compute node the below exception is logged.
<182>Apr 17 14:36:37 compute-01 nova-nova.compute.resource_tracker INFO: Compute_service record updated for compute-01:compute-01.trelab.tieto.com
<179>Apr 17 14:37:42 compute-01 nova-nova.servicegroup.drivers.db ERROR: model server went away
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py", line 96, in _report_state
service.service_ref, state_catalog)
File "/usr/lib/python2.7/dist-packages/nova/conductor/api.py", line 269, in service_update
return self._manager.service_update(context, service, values)
File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 397, in service_update
service=service_p, values=values)
File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 85, in call
return self._invoke(self.proxy.call, ctxt, method, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/rpcclient.py", line 63, in _invoke
return cast_or_call(ctxt, msg, **self.kwargs)
File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/proxy.py", line 130, in call
exc.info, real_topic, msg.get('method'))
Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "service_update" info: "<unknown>"
<180>Apr 17 14:38:08 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
<182>Apr 17 14:40:09 compute-01 nova-nova.compute.manager INFO: Updating bandwidth usage cache
<180>Apr 17 14:44:39 compute-01 nova-nova.compute.resource_tracker AUDIT: Auditing locally available compute resources
The compute node goes to "down" state in nova service-list and it does
not recover. Only when I start the other controller again the compute
node recovers. Sometimes it is needed to restart nova-compute to
recover.
I have a havana level system. In the system I have upgraded RabbitMQ
to 3.2.4 version and created a policy so that mirrored queues are used
in RabbitMQ.
$ rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
rabbitmqctl cluster_status is showing both controllers as running
nodes.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1308981/+subscriptions
Follow ups
References