← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1404241] Re: nova-compute state not updated

 

It sounds like there may be an issue in oslo.messaging with reconnecting

** Also affects: oslo.messaging
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1404241

Title:
  nova-compute state not updated

Status in OpenStack Compute (Nova):
  Incomplete
Status in Messaging API for OpenStack:
  New

Bug description:
  I'm running 2014.2.1 on CentOS7. 1 controller and 5 compute nodes are
  deployed using packstack.

  Whenever I reboot the controller node, some of nova-compute services
  report state=XXX even after 60 minutes after reboot completed and
  controller node is up and running again:

  [root@juno1 ~(keystone_admin)]# nova-manage service list
  Binary           Host                                 Zone             Status     State Updated_At
  nova-consoleauth juno1                                internal         enabled    :-)   2014-12-19 13:17:48
  nova-scheduler   juno1                                internal         enabled    :-)   2014-12-19 13:17:47
  nova-conductor   juno1                                internal         enabled    :-)   2014-12-19 13:17:47
  nova-cert        juno1                                internal         enabled    :-)   2014-12-19 13:17:48
  nova-compute     juno4                                nova             enabled    XXX   2014-12-19 12:26:56
  nova-compute     juno5                                nova             enabled    :-)   2014-12-19 13:17:47
  nova-compute     juno6                                nova             enabled    :-)   2014-12-19 13:17:46
  nova-compute     juno3                                nova             enabled    :-)   2014-12-19 13:17:46
  nova-compute     juno2                                nova             enabled    XXX   2014-12-19 12:21:52

  Here is the chunk of nova-compute log from juno4:

  2014-12-19 15:46:02.082 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
  2014-12-19 15:46:02.083 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     return method()
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 735, in _consume
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     return self.connection.drain_events(timeout=timeout)
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 281, in drain_events
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     return self.transport.drain_events(self.connection, **kwargs)
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 94, in drain_events
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     return connection.drain_events(**kwargs)
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 299, in drain_events
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     chanmap, None, timeout=timeout,
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 362, in _wait_multiple
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     channel, method_sig, args, content = read_timeout(timeout)
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 333, in read_timeout
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     return self.method_reader.read_method()
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit     raise m
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed
  2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit 
  2014-12-19 15:46:02.084 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
  2014-12-19 15:46:03.084 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:03.096 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:03.105 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 9.167.185.90:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
  2014-12-19 15:46:03.105 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 9.167.185.90:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
  2014-12-19 15:46:04.106 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
  2014-12-19 15:46:04.106 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
  2014-12-19 15:46:05.106 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:05.116 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:05.157 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:05.159 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:33.229 5193 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
  2014-12-19 15:46:33.781 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'conductor': [Errno 104] Connection reset by peer
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     return method()
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 752, in _publish
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     publisher = cls(self.conf, self.channel, topic=topic, **kwargs)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 378, in __init__
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     **options)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 326, in __init__
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     self.reconnect(channel)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 334, in reconnect
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     routing_key=self.routing_key)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 83, in __init__
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     self.revive(self._channel)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 214, in revive
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     self.declare()
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 100, in declare
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     self.exchange.declare()
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 163, in declare
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     nowait=nowait, passive=passive,
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 613, in exchange_declare
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     self._send_method((40, 10), args)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     self.channel_id, method_sig, args, content,
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     write_frame(1, channel, payload)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 177, in write_frame
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     frame_type, channel, size, payload, 0xce,
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 359, in sendall
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     tail = self.send(data, flags)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 342, in send
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit     total_sent += fd.send(data[total_sent:], flags)
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 104] Connection reset by peer
  2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit 
  2014-12-19 15:46:33.782 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
  2014-12-19 15:46:34.782 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
  2014-12-19 15:46:34.796 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 9.167.185.90:5672

  I see, that nova-compute lost AMQ connection with controller node because of reboot, but then connection was established again after reboot completed.
  I also can manage running instances on that compute node and see correct instance states.

  The problem can be solved if I do: systemctl restart openstack-nova-
  compute on nodes with XXX state. After that I see expected :-)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1404241/+subscriptions


References