← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1804262] [NEW] ComputeManager._run_image_cache_manager_pass times out when running on NFS

 

Public bug reported:

Description
===========

Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:

host:/share /var/lib/nova/instances nfs
rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx

But recently we noticed an increase of  Error during
ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed
out waiting for a reply t

which we mitigated by increasing the rpc_response_timeout.
As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.

Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
What else can we do to prevent these errors ?

Actual result
=============
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     task(self, context)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     self.driver.manage_image_cache(context, filtered_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     self.image_cache_manager.update(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     running = self._list_running_instances(context, all_instances)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     context, [instance.uuid for instance in all_instances])
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     bdms = cls.get_by_instance_uuids(context, instance_uuids)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     args, kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     args=args, kwargs=kwargs)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     retry=self.retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     timeout=timeout, retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     retry=retry)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     result = self._waiter.wait(msg_id, timeout)
2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait

Expected result
===============
rpc_response_timeout should remain constant regardless of instances operated under /var/lib/nova/instances

Environment
===========
Ubuntu 16.04.4 LTS (amd64)

pips:
nova==16.1.5.dev57
nova-lxd==16.0.1.dev1
nova-powervm==5.0.4.dev3
python-novaclient==9.1.2

debs:
libvirt-bin  3.6.0-1ubuntu6.8~cloud0
libvirt-clients  3.6.0-1ubuntu6.8~cloud0
libvirt-daemon  3.6.0-1ubuntu6.8~cloud0
libvirt-daemon-system  3.6.0-1ubuntu6.8~cloud0
libvirt0 3.6.0-1ubuntu6.8~cloud0
python-libvirt  3.5.0-1build1~cloud0

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  
  Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
  We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:
  
  host:/share /var/lib/nova/instances nfs
  rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx
  
- 
- But recently we noticed an increase of  Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply t
+ But recently we noticed an increase of  Error during
+ ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed
+ out waiting for a reply t
  
  which we mitigated by increasing the rpc_response_timeout.
  As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.
  
  Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
  What else can we do to prevent these errors ?
  
  Actual result
  =============
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     task(self, context)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     self.driver.manage_image_cache(context, filtered_instances)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     self.image_cache_manager.update(context, all_instances)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     running = self._list_running_instances(context, all_instances)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     context, [instance.uuid for instance in all_instances])
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     bdms = cls.get_by_instance_uuids(context, instance_uuids)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     args, kwargs)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     args=args, kwargs=kwargs)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     retry=self.retry)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     timeout=timeout, retry=retry)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     retry=retry)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     result = self._waiter.wait(msg_id, timeout)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait
  
- 
  Expected result
  ===============
- rpc_response_timeout should remain constant regardless of instances operated under /var/log
+ rpc_response_timeout should remain constant regardless of instances operated under /var/lib/nova/instances
  
  Environment
  ===========
  Ubuntu 16.04.4 LTS (amd64)
  
  pips:
  nova==16.1.5.dev57
  nova-lxd==16.0.1.dev1
  nova-powervm==5.0.4.dev3
  python-novaclient==9.1.2
  
  debs:
  libvirt-bin  3.6.0-1ubuntu6.8~cloud0
  libvirt-clients  3.6.0-1ubuntu6.8~cloud0
  libvirt-daemon  3.6.0-1ubuntu6.8~cloud0
  libvirt-daemon-system  3.6.0-1ubuntu6.8~cloud0
  libvirt0 3.6.0-1ubuntu6.8~cloud0
  python-libvirt  3.5.0-1build1~cloud0

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1804262

Title:
  ComputeManager._run_image_cache_manager_pass times out when running on
  NFS

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Under Pike we are operating a /var/lib/nova/instances mounted on a clustered Netapp A700 AFF. The share is mounted across the entire nova fleet of currently 29 hosts (10G networking) with ~ 720 instances.
  We are mounting the share with standard NFS options are considering actimeo as improvement, unless there are expected issues around metadata consistency issues:

  host:/share /var/lib/nova/instances nfs
  rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxxx

  But recently we noticed an increase of  Error during
  ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed
  out waiting for a reply t

  which we mitigated by increasing the rpc_response_timeout.
  As the result of the increased errors we saw nova-compute service flapping which caused other issues like volume attachments got delayed or erred out.

  Am I right with the assumption that the resource tracker and services updates are happening inside the same thread ?
  What else can we do to prevent these errors ?

  Actual result
  =============
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task [req-73d6cf48-d94a-41e4-a59e-9965fec4972d - - - - -] Error during ComputeManager._run_image_cache_manager_pass: MessagingTimeout: Timed out waiting for a reply to message ID 29820aa832354e788c7d50a533823c2a
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     task(self, context)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/compute/manager.py", line 7118, in _run_image_cache_manager_pass
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     self.driver.manage_image_cache(context, filtered_instances)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7563, in manage_image_cache
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     self.image_cache_manager.update(context, all_instances)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/libvirt/imagecache.py", line 414, in update
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     running = self._list_running_instances(context, all_instances)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/virt/imagecache.py", line 54, in _list_running_instances
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     context, [instance.uuid for instance in all_instances])
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/objects/block_device.py", line 333, in bdms_by_instance_uuid
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     bdms = cls.get_by_instance_uuids(context, instance_uuids)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     args, kwargs)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     args=args, kwargs=kwargs)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     retry=self.retry)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     timeout=timeout, retry=retry)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     retry=retry)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task     result = self._waiter.wait(msg_id, timeout)
  2018-11-20 14:09:40.413 4294 ERROR oslo_service.periodic_task   File "/openstack/venvs/nova-r16.2.4/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait

  Expected result
  ===============
  rpc_response_timeout should remain constant regardless of instances operated under /var/lib/nova/instances

  Environment
  ===========
  Ubuntu 16.04.4 LTS (amd64)

  pips:
  nova==16.1.5.dev57
  nova-lxd==16.0.1.dev1
  nova-powervm==5.0.4.dev3
  python-novaclient==9.1.2

  debs:
  libvirt-bin  3.6.0-1ubuntu6.8~cloud0
  libvirt-clients  3.6.0-1ubuntu6.8~cloud0
  libvirt-daemon  3.6.0-1ubuntu6.8~cloud0
  libvirt-daemon-system  3.6.0-1ubuntu6.8~cloud0
  libvirt0 3.6.0-1ubuntu6.8~cloud0
  python-libvirt  3.5.0-1build1~cloud0

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1804262/+subscriptions


Follow ups