← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1646896] [NEW] System hangs when using NFS storage backend with loopback mounts

 

Public bug reported:

Description
===========
When using high speed disks and NFS as storage backend, during high loads the nfs mounts hang indefinitely. 

Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
* Spin up a VM with a mounted cinder volume from an NFS backend
* Generate some read/write load 
* Occationally the loopback NFS mounts will hang. The machine and everything else using that mount will also hang.


Expected result
===============
The system should run stably 

Actual result
=============
Occasionally  , usually during higher load the system will hang.

Environment
===========
1. Exact version of OpenStack you are running. See the following
Openstack Kilo

openstack-nova-compute-2015.1.1-1.el7.noarch
openstack-nova-cert-2015.1.1-1.el7.noarch
python-nova-2015.1.1-1.el7.noarch
openstack-nova-console-2015.1.1-1.el7.noarch
openstack-nova-novncproxy-2015.1.1-1.el7.noarch
openstack-nova-common-2015.1.1-1.el7.noarch
python-novaclient-2.23.0-1.el7.noarch
openstack-nova-scheduler-2015.1.1-1.el7.noarch
openstack-nova-api-2015.1.1-1.el7.noarch
openstack-nova-conductor-2015.1.1-1.el7.noarch


2. Which hypervisor did you use?
  Libvirt + KVM,

2. Which storage type did you use?
   NFS

3. Which networking type did you use?
    Neutron with OpenVSwitch

Logs & Configs
==============

Nova.conf:
[DEFAULT]
notification_driver=ceilometer.compute.nova_notifier
notification_driver=nova.openstack.common.notifier.rpc_notifier
notification_driver =
notification_topics=notifications
rpc_backend=rabbit
internal_service_availability_zone=internal
default_availability_zone=nova
notify_api_faults=False
state_path=/openstack/nova
report_interval=10
enabled_apis=ec2,osapi_compute,metadata
ec2_listen=0.0.0.0
ec2_workers=2
osapi_compute_listen=0.0.0.0
osapi_compute_workers=2
metadata_listen=0.0.0.0
metadata_workers=2
compute_manager=nova.compute.manager.ComputeManager
service_down_time=60
rootwrap_config=/etc/nova/rootwrap.conf
auth_strategy=keystone
use_forwarded_for=False
novncproxy_host=192.168.0.1
novncproxy_port=6080
allow_resize_to_same_host=true
block_device_allocate_retries=1560
heal_instance_info_cache_interval=60
reserved_host_memory_mb=512
network_api_class=nova.network.neutronv2.api.API
default_floating_pool=public
force_snat_range=0.0.0.0/0
metadata_host=192.168.0.1
dhcp_domain=novalocal
security_group_api=neutron
debug=True
verbose=True
log_dir=/var/log/nova
use_syslog=False
cpu_allocation_ratio=16.0
ram_allocation_ratio=1.5
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,CoreFilter
scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler
compute_driver=libvirt.LibvirtDriver
vif_plugging_is_fatal=True
vif_plugging_timeout=300
firewall_driver=nova.virt.firewall.NoopFirewallDriver
remove_unused_base_images=true
force_raw_images=True
novncproxy_base_url=http://0.0.0.0:6080/vnc_auto.html
vncserver_listen=192.168.0.1
vncserver_proxyclient_address=127.0.0.1
vnc_enabled=True
vnc_keymap=en-us
volume_api_class=nova.volume.cinder.API
amqp_durable_queues=False
sql_connection=mysql:XXXXXXXXXXX
lock_path=/openstack/nova/tmp
osapi_volume_listen=0.0.0.0
[api_database]
[barbican]
[cells]
[cinder]
[conductor]
workers=2
[database]
[ephemeral_storage_encryption]
[glance]
api_servers=192.168.0.1:9292
[guestfs]
[hyperv]
[image_file_url]
[ironic]
[keymgr]
[keystone_authtoken]
auth_uri=http://192.168.0.1:5000/v2.0
identity_uri=http://192.168.0.1:35357
admin_user=nova
admin_password=XXXXXXx
[libvirt]
virt_type=kvm
inject_password=False
inject_key=False
inject_partition=-1
live_migration_uri=qemu+tcp://nova@%s/system
cpu_mode=host-model
disk_cachemodes=file=writethrough,block=writethrough
nfs_mount_options=rw,hard,intr,nolock,vers=4.1,timeo=10
vif_driver=nova.virt.libvirt.vif.LibvirtGenericVIFDriver
[metrics]
[neutron]
......


Cinder.conf:

[nfs_ssd]
nfs_used_ratio=0.95
nfs_oversub_ratio=10.0
volume_driver=cinder.volume.drivers.nfs.NfsDriver
nfs_shares_config=/etc/cinder/nfs_shares_ssd.conf
volume_backend_name=nfs_ssd
quota_volumes = -1
nfs_mount_options=rw,hard,intr,nolock


- No notable output in nova log


- System log /dmesg after a hang:

Nov 24 04:10:41 openstack1.itgix.com kernel: INFO: task qemu-kvm:11726 blocked for more than 120 seconds.
Nov 24 04:10:41 openstack1.itgix.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 04:10:41 openstack1.itgix.com kernel: qemu-kvm        D ffff88118b1b1f60     0 11726      1 0x00000080
Nov 24 04:10:41 openstack1.itgix.com kernel:  ffff880da4c77c40 0000000000000082 ffff881184b86780 ffff880da4c77fd8
Nov 24 04:10:41 openstack1.itgix.com kernel:  ffff880da4c77fd8 ffff880da4c77fd8 ffff881184b86780 ffff88118b1b1f58
Nov 24 04:10:41 openstack1.itgix.com kernel:  ffff88118b1b1f5c ffff881184b86780 00000000ffffffff ffff88118b1b1f60
Nov 24 04:10:41 openstack1.itgix.com kernel: Call Trace:
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163bf29>] schedule_preempt_disabled+0x29/0x70
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff81639c25>] __mutex_lock_slowpath+0xc5/0x1c0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff812fbff4>] ? timerqueue_del+0x24/0x70
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163908f>] mutex_lock+0x1f/0x2f
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8116b60a>] generic_file_aio_write+0x4a/0xc0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffffa06dc03b>] nfs_file_write+0xbb/0x1d0 [nfs]
Nov 24 04:10:41 openstack1.itgix.com kernel: Call Trace:
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163bf29>] schedule_preempt_disabled+0x29/0x70
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff81639c25>] __mutex_lock_slowpath+0xc5/0x1c0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163908f>] mutex_lock+0x1f/0x2f
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8116b60a>] generic_file_aio_write+0x4a/0xc0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffffa06dc03b>] nfs_file_write+0xbb/0x1d0 [nfs]
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff811dde5d>] do_sync_write+0x8d/0xd0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff811de67d>] vfs_write+0xbd/0x1e0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff811df2d2>] SyS_pwrite64+0x92/0xc0
Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff81645ec9>] system_call_fastpath+0x16/0x1b

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646896

Title:
  System hangs when using NFS storage backend with loopback mounts

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When using high speed disks and NFS as storage backend, during high loads the nfs mounts hang indefinitely. 

  Steps to reproduce
  ==================
  A chronological list of steps which will bring off the
  issue you noticed:
  * Spin up a VM with a mounted cinder volume from an NFS backend
  * Generate some read/write load 
  * Occationally the loopback NFS mounts will hang. The machine and everything else using that mount will also hang.

  
  Expected result
  ===============
  The system should run stably 

  Actual result
  =============
  Occasionally  , usually during higher load the system will hang.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
  Openstack Kilo

  openstack-nova-compute-2015.1.1-1.el7.noarch
  openstack-nova-cert-2015.1.1-1.el7.noarch
  python-nova-2015.1.1-1.el7.noarch
  openstack-nova-console-2015.1.1-1.el7.noarch
  openstack-nova-novncproxy-2015.1.1-1.el7.noarch
  openstack-nova-common-2015.1.1-1.el7.noarch
  python-novaclient-2.23.0-1.el7.noarch
  openstack-nova-scheduler-2015.1.1-1.el7.noarch
  openstack-nova-api-2015.1.1-1.el7.noarch
  openstack-nova-conductor-2015.1.1-1.el7.noarch

  
  2. Which hypervisor did you use?
    Libvirt + KVM,

  2. Which storage type did you use?
     NFS

  3. Which networking type did you use?
      Neutron with OpenVSwitch

  Logs & Configs
  ==============

  Nova.conf:
  [DEFAULT]
  notification_driver=ceilometer.compute.nova_notifier
  notification_driver=nova.openstack.common.notifier.rpc_notifier
  notification_driver =
  notification_topics=notifications
  rpc_backend=rabbit
  internal_service_availability_zone=internal
  default_availability_zone=nova
  notify_api_faults=False
  state_path=/openstack/nova
  report_interval=10
  enabled_apis=ec2,osapi_compute,metadata
  ec2_listen=0.0.0.0
  ec2_workers=2
  osapi_compute_listen=0.0.0.0
  osapi_compute_workers=2
  metadata_listen=0.0.0.0
  metadata_workers=2
  compute_manager=nova.compute.manager.ComputeManager
  service_down_time=60
  rootwrap_config=/etc/nova/rootwrap.conf
  auth_strategy=keystone
  use_forwarded_for=False
  novncproxy_host=192.168.0.1
  novncproxy_port=6080
  allow_resize_to_same_host=true
  block_device_allocate_retries=1560
  heal_instance_info_cache_interval=60
  reserved_host_memory_mb=512
  network_api_class=nova.network.neutronv2.api.API
  default_floating_pool=public
  force_snat_range=0.0.0.0/0
  metadata_host=192.168.0.1
  dhcp_domain=novalocal
  security_group_api=neutron
  debug=True
  verbose=True
  log_dir=/var/log/nova
  use_syslog=False
  cpu_allocation_ratio=16.0
  ram_allocation_ratio=1.5
  scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,CoreFilter
  scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler
  compute_driver=libvirt.LibvirtDriver
  vif_plugging_is_fatal=True
  vif_plugging_timeout=300
  firewall_driver=nova.virt.firewall.NoopFirewallDriver
  remove_unused_base_images=true
  force_raw_images=True
  novncproxy_base_url=http://0.0.0.0:6080/vnc_auto.html
  vncserver_listen=192.168.0.1
  vncserver_proxyclient_address=127.0.0.1
  vnc_enabled=True
  vnc_keymap=en-us
  volume_api_class=nova.volume.cinder.API
  amqp_durable_queues=False
  sql_connection=mysql:XXXXXXXXXXX
  lock_path=/openstack/nova/tmp
  osapi_volume_listen=0.0.0.0
  [api_database]
  [barbican]
  [cells]
  [cinder]
  [conductor]
  workers=2
  [database]
  [ephemeral_storage_encryption]
  [glance]
  api_servers=192.168.0.1:9292
  [guestfs]
  [hyperv]
  [image_file_url]
  [ironic]
  [keymgr]
  [keystone_authtoken]
  auth_uri=http://192.168.0.1:5000/v2.0
  identity_uri=http://192.168.0.1:35357
  admin_user=nova
  admin_password=XXXXXXx
  [libvirt]
  virt_type=kvm
  inject_password=False
  inject_key=False
  inject_partition=-1
  live_migration_uri=qemu+tcp://nova@%s/system
  cpu_mode=host-model
  disk_cachemodes=file=writethrough,block=writethrough
  nfs_mount_options=rw,hard,intr,nolock,vers=4.1,timeo=10
  vif_driver=nova.virt.libvirt.vif.LibvirtGenericVIFDriver
  [metrics]
  [neutron]
  ......

  
  Cinder.conf:

  [nfs_ssd]
  nfs_used_ratio=0.95
  nfs_oversub_ratio=10.0
  volume_driver=cinder.volume.drivers.nfs.NfsDriver
  nfs_shares_config=/etc/cinder/nfs_shares_ssd.conf
  volume_backend_name=nfs_ssd
  quota_volumes = -1
  nfs_mount_options=rw,hard,intr,nolock

  
  - No notable output in nova log


  - System log /dmesg after a hang:

  Nov 24 04:10:41 openstack1.itgix.com kernel: INFO: task qemu-kvm:11726 blocked for more than 120 seconds.
  Nov 24 04:10:41 openstack1.itgix.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  Nov 24 04:10:41 openstack1.itgix.com kernel: qemu-kvm        D ffff88118b1b1f60     0 11726      1 0x00000080
  Nov 24 04:10:41 openstack1.itgix.com kernel:  ffff880da4c77c40 0000000000000082 ffff881184b86780 ffff880da4c77fd8
  Nov 24 04:10:41 openstack1.itgix.com kernel:  ffff880da4c77fd8 ffff880da4c77fd8 ffff881184b86780 ffff88118b1b1f58
  Nov 24 04:10:41 openstack1.itgix.com kernel:  ffff88118b1b1f5c ffff881184b86780 00000000ffffffff ffff88118b1b1f60
  Nov 24 04:10:41 openstack1.itgix.com kernel: Call Trace:
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163bf29>] schedule_preempt_disabled+0x29/0x70
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff81639c25>] __mutex_lock_slowpath+0xc5/0x1c0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff812fbff4>] ? timerqueue_del+0x24/0x70
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163908f>] mutex_lock+0x1f/0x2f
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8116b60a>] generic_file_aio_write+0x4a/0xc0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffffa06dc03b>] nfs_file_write+0xbb/0x1d0 [nfs]
  Nov 24 04:10:41 openstack1.itgix.com kernel: Call Trace:
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163bf29>] schedule_preempt_disabled+0x29/0x70
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff81639c25>] __mutex_lock_slowpath+0xc5/0x1c0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8163908f>] mutex_lock+0x1f/0x2f
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff8116b60a>] generic_file_aio_write+0x4a/0xc0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffffa06dc03b>] nfs_file_write+0xbb/0x1d0 [nfs]
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff811dde5d>] do_sync_write+0x8d/0xd0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff811de67d>] vfs_write+0xbd/0x1e0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff811df2d2>] SyS_pwrite64+0x92/0xc0
  Nov 24 04:10:41 openstack1.itgix.com kernel:  [<ffffffff81645ec9>] system_call_fastpath+0x16/0x1b

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1646896/+subscriptions