yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #43285
[Bug 1525802] [NEW] live migration with multipath cinder volumes crashes node
Public bug reported:
Hello,
When issuing a live migration between kvm nodes having multipath cinder
volume it sometimes hangs and causes qemu-kvm to crash, the only
solution is a restart of the kvm node.
Sometimes when live migrating you get stuck when it tries to migrate the active RAM, you will see something like this in the nova-compute.log:
http://paste.openstack.org/show/481773/
As you can see it get's nowhere.
What is happening in the backgroun is that for some reason the multipath volumes when viewing with 'multipath -ll' they go into a 'faulty running' state and causes issues with the block device causing the qemu-kvm process to hang, the kvm node also tries to run blkid and kpart but all of those hang, which means you can get 100+ load just for those stuck processes.
[1015086.978188] end_request: I/O error, dev sdg, sector 41942912
[1015086.978398] device-mapper: multipath: Failing path 8:96.
[1015088.547034] qbr8eff45f7-ed: port 1(qvb8eff45f7-ed) entered disabled state
[1015088.791695] INFO: task qemu-system-x86:19383 blocked for more than 120 seconds.
[1015088.791940] Tainted: P OX 3.13.0-68-generic #111-Ubuntu
[1015088.792147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1015088.792396] qemu-system-x86 D ffff88301f2f3180 0 19383 1 0x00000000
[1015088.792404] ffff8817440ada88 0000000000000086 ffff8817fa574800 ffff8817440adfd8
[1015088.792414] 0000000000013180 0000000000013180 ffff8817fa574800 ffff88301f2f3a18
[1015088.792420] 0000000000000000 ffff882ff7ab5280 0000000000000000 ffff8817fa574800
[1015088.792426] Call Trace:
[1015088.792440] [<ffffffff8172880d>] io_schedule+0x9d/0x140
[1015088.792449] [<ffffffff811fc304>] do_blockdev_direct_IO+0x1ce4/0x2910
[1015088.792456] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10
[1015088.792462] [<ffffffff811fcf85>] __blockdev_direct_IO+0x55/0x60
[1015088.792467] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10
[1015088.792472] [<ffffffff811f7866>] blkdev_direct_IO+0x56/0x60
[1015088.792476] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10
[1015088.792482] [<ffffffff81150ce1>] generic_file_direct_write+0xc1/0x180
[1015088.792487] [<ffffffff811510a5>] __generic_file_aio_write+0x305/0x3d0
[1015088.792492] [<ffffffff811f8146>] blkdev_aio_write+0x46/0x90
[1015088.792501] [<ffffffff811bdc2a>] do_sync_write+0x5a/0x90
[1015088.792507] [<ffffffff811be3b4>] vfs_write+0xb4/0x1f0
[1015088.792512] [<ffffffff811bef62>] SyS_pwrite64+0x72/0xb0
[1015088.792519] [<ffffffff81734cdd>] system_call_fastpath+0x1a/0x1f
root 19410 0.0 0.0 0 0 ? D 08:12 0:00 [blkid]
root 19575 0.0 0.0 0 0 ? D 08:13 0:00 [blkid]
root 19584 0.0 0.0 28276 1076 ? S 08:13 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 21734 0.0 0.0 28276 1080 ? D 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 21735 0.0 0.0 28276 1076 ? S 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 22419 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 22420 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 22864 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 22865 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 23316 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 23317 0.0 0.0 28276 1072 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 23756 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 24200 0.0 0.0 28276 1076 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 24637 0.0 0.0 28276 1072 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 25058 0.0 0.0 28276 1076 ? D 08:19 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root@kvm3:~#
Ultimately this will cause so much issues on your kvm node that the only
fix is a restart because of all the libvirt locks you won't be able to
stop, restart or destroy the qemu-kvm process and issuing a kill -9
won't help you either, the only solution is a restart.
What will happen is that your live migration will fail with something like this.
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [req-99771cf6-d17e-49f7-a01d-38201afbce69 212f451de64b4ae89c853f1430510037 e47ebdf3f3934025b37df3b85bdfd565 - - -] [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Live migration failed.
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Traceback (most recent call last):
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5108, in _do_live_migration
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] block_migration, migrate_data)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5431, in live_migration
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] migrate_data)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 6071, in _live_migration
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom, finish_event)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5919, in _live_migration_monitor
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom.abortJob()
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = execute(f, *args, **kwargs)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] six.reraise(c, e, tb)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = meth(*args, **kwargs)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 482, in abortJob
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] libvirtError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3)
I'm not sure if this is the same bug as
https://bugs.launchpad.net/nova/+bug/1419577 but I want to make sure to
bring some light into this area since this affect our operations very
much.
Best regards
** Affects: nova
Importance: Undecided
Status: New
** Tags: cinder multipath
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1525802
Title:
live migration with multipath cinder volumes crashes node
Status in OpenStack Compute (nova):
New
Bug description:
Hello,
When issuing a live migration between kvm nodes having multipath
cinder volume it sometimes hangs and causes qemu-kvm to crash, the
only solution is a restart of the kvm node.
Sometimes when live migrating you get stuck when it tries to migrate the active RAM, you will see something like this in the nova-compute.log:
http://paste.openstack.org/show/481773/
As you can see it get's nowhere.
What is happening in the backgroun is that for some reason the multipath volumes when viewing with 'multipath -ll' they go into a 'faulty running' state and causes issues with the block device causing the qemu-kvm process to hang, the kvm node also tries to run blkid and kpart but all of those hang, which means you can get 100+ load just for those stuck processes.
[1015086.978188] end_request: I/O error, dev sdg, sector 41942912
[1015086.978398] device-mapper: multipath: Failing path 8:96.
[1015088.547034] qbr8eff45f7-ed: port 1(qvb8eff45f7-ed) entered disabled state
[1015088.791695] INFO: task qemu-system-x86:19383 blocked for more than 120 seconds.
[1015088.791940] Tainted: P OX 3.13.0-68-generic #111-Ubuntu
[1015088.792147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1015088.792396] qemu-system-x86 D ffff88301f2f3180 0 19383 1 0x00000000
[1015088.792404] ffff8817440ada88 0000000000000086 ffff8817fa574800 ffff8817440adfd8
[1015088.792414] 0000000000013180 0000000000013180 ffff8817fa574800 ffff88301f2f3a18
[1015088.792420] 0000000000000000 ffff882ff7ab5280 0000000000000000 ffff8817fa574800
[1015088.792426] Call Trace:
[1015088.792440] [<ffffffff8172880d>] io_schedule+0x9d/0x140
[1015088.792449] [<ffffffff811fc304>] do_blockdev_direct_IO+0x1ce4/0x2910
[1015088.792456] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10
[1015088.792462] [<ffffffff811fcf85>] __blockdev_direct_IO+0x55/0x60
[1015088.792467] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10
[1015088.792472] [<ffffffff811f7866>] blkdev_direct_IO+0x56/0x60
[1015088.792476] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10
[1015088.792482] [<ffffffff81150ce1>] generic_file_direct_write+0xc1/0x180
[1015088.792487] [<ffffffff811510a5>] __generic_file_aio_write+0x305/0x3d0
[1015088.792492] [<ffffffff811f8146>] blkdev_aio_write+0x46/0x90
[1015088.792501] [<ffffffff811bdc2a>] do_sync_write+0x5a/0x90
[1015088.792507] [<ffffffff811be3b4>] vfs_write+0xb4/0x1f0
[1015088.792512] [<ffffffff811bef62>] SyS_pwrite64+0x72/0xb0
[1015088.792519] [<ffffffff81734cdd>] system_call_fastpath+0x1a/0x1f
root 19410 0.0 0.0 0 0 ? D 08:12 0:00 [blkid]
root 19575 0.0 0.0 0 0 ? D 08:13 0:00 [blkid]
root 19584 0.0 0.0 28276 1076 ? S 08:13 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 21734 0.0 0.0 28276 1080 ? D 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 21735 0.0 0.0 28276 1076 ? S 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 22419 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 22420 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 22864 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 22865 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 23316 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 23317 0.0 0.0 28276 1072 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed
root 23756 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 24200 0.0 0.0 28276 1076 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 24637 0.0 0.0 28276 1072 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root 25058 0.0 0.0 28276 1076 ? D 08:19 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6
root@kvm3:~#
Ultimately this will cause so much issues on your kvm node that the
only fix is a restart because of all the libvirt locks you won't be
able to stop, restart or destroy the qemu-kvm process and issuing a
kill -9 won't help you either, the only solution is a restart.
What will happen is that your live migration will fail with something like this.
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [req-99771cf6-d17e-49f7-a01d-38201afbce69 212f451de64b4ae89c853f1430510037 e47ebdf3f3934025b37df3b85bdfd565 - - -] [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Live migration failed.
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Traceback (most recent call last):
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5108, in _do_live_migration
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] block_migration, migrate_data)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5431, in live_migration
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] migrate_data)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 6071, in _live_migration
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom, finish_event)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5919, in _live_migration_monitor
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom.abortJob()
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = execute(f, *args, **kwargs)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] six.reraise(c, e, tb)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = meth(*args, **kwargs)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 482, in abortJob
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self)
2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] libvirtError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3)
I'm not sure if this is the same bug as
https://bugs.launchpad.net/nova/+bug/1419577 but I want to make sure
to bring some light into this area since this affect our operations
very much.
Best regards
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1525802/+subscriptions
Follow ups