← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1547544] Re: heat: MessagingTimeout: Timed out waiting for a reply to message ID

 

>From looking at the dstat output, the node in question is above load avg
of 11 for nearly 2 hours, about an hour into it is where your error
happens.

Realistically, that's just too much work being asked of the node. We
have found in the gate that once you get sustained load average over 10
things start to break down. There is no bug fix for this, it's just a
fallout of our architecture.

Marking as won't fix, as I don't think there is anything actionable
here. If you have performance improvements in your environment that make
this better, that's great. However there are bounds in which the nova
compute worker just does fail over, and there is not much to be done
about it.

** Changed in: nova
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1547544

Title:
  heat: MessagingTimeout: Timed out waiting for a reply to message ID

Status in OpenStack Compute (nova):
  Won't Fix
Status in oslo.messaging:
  New

Bug description:
  Setup:

  Single controller[48 GB RAM, 16vCPU, 120GB Disk]
  3 Network Nodes
  100 ESX hypervisors distributed in 10 nova-compute nodes

  Test:

  1. Create /16 network
  2. Heat template which which will launch 100 instances on network created step 1
  3. Create 10 stack back2back so that we reach 1000 instances without waiting for previous stack to complete

  Observation:

  stack creations are failing while nova run_periodic_tasks at different
  places like _heal_instance_info_cache,  _sync_scheduler_instance_info,
  _update_available_resource etc

  Have attached sample heat template, heat logs, nova compute log from
  one of the host.

  
  Logs:

  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     return f(*args, **kwargs)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/opt/stack/nova/nova/compute/resource_tracker.py", line 553, in _update_available_resource
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     context, self.host, self.nodename)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 174, in wrapper
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     args, kwargs)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/opt/stack/nova/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     args=args, kwargs=kwargs)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     retry=self.retry)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     timeout=timeout, retry=retry)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 465, in send
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     retry=retry)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 454, in _send
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     result = self._waiter.wait(msg_id, timeout)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 337, in wait
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     message = self.waiters.get(msg_id, timeout=timeout)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 239, in get
  2016-02-19 04:21:54.691 TRACE nova.compute.manager     'to message ID %s' % msg_id)
  2016-02-19 04:21:54.691 TRACE nova.compute.manager MessagingTimeout: Timed out waiting for a reply to message ID a87a7f358a0948efa3ab5beb0c8f45e7
  --

  
  stack@esx-compute-9:/opt/stack/nova$ git log -1
  commit d51c5670d8d26e989d92eb29658eed8113034c0f
  Merge: 4fade90 30d5d80
  Author: Jenkins <jenkins@xxxxxxxxxxxxxxxxxxxx>
  Date:   Thu Feb 18 17:56:32 2016 +0000

      Merge "reset task_state after select_destinations failed."
  stack@esx-compute-9:/opt/stack/nova$

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1547544/+subscriptions


References