yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #66235
[Bug 1706083] Re: Post-migration, Cinder volumes lose disk cache value, resulting in I/O latency
Reviewed: https://review.openstack.org/485752
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=14c38ac0f253036da79f9d07aedf7dfd5778fde8
Submitter: Jenkins
Branch: master
commit 14c38ac0f253036da79f9d07aedf7dfd5778fde8
Author: Kashyap Chamarthy <kchamart@xxxxxxxxxx>
Date: Thu Jul 20 19:01:23 2017 +0200
libvirt: Post-migration, set cache value for Cinder volume(s)
This was noticed in a downstream bug when a Nova instance with Cinder
volume (in this case, both the Nova instance storage _and_ Cinder volume
are located on Ceph) is migrated to a target Compute node, the disk
cache value for the Cinder volume gets changed. I.e. the QEMU
command-line for the Cinder volume stored on Ceph turns into the
following:
Pre-migration, QEMU command-line for the Nova instance:
[...] -drive file=rbd:volumes/volume-[...],cache=writeback
Post-migration, QEMU command-line for the Nova instance:
[...] -drive file=rbd:volumes/volume-[...],cache=none
Furthermore, Jason Dillaman from Ceph confirms RBD cache being enabled
pre-migration:
$ ceph --admin-daemon /var/run/qemu/ceph-client.openstack.[...] \
config get rbd_cache
{
"rbd_cache": "true"
}
And disabled, post-migration:
$ ceph --admin-daemon /var/run/qemu/ceph-client.openstack.[...] \
config get rbd_cache
{
"rbd_cache": "false"
}
This change in cache value post-migration causes I/O latency on the
Cinder volume.
From a chat with Daniel Berrangé on IRC: Prior to live migration, Nova
rewrites all the <disk> elements, and passes this updated guest XML
across to target libvirt. And it is never calling _set_cache_mode()
when doing this. So `nova.conf`'s `writeback` setting is getting lost,
leaving the default `cache=none` setting. And this mistake (of leaving
the default cache value to 'none') will of course be correct when you
reboot the guest on the target later.
So:
- Call _set_cache_mode() in _get_volume_config() method -- because it
is a callback function to _update_volume_xml() in
nova/virt/libvirt/migration.py.
- And remove duplicate calls to _set_cache_mode() in
_get_guest_storage_config() and attach_volume().
- Fix broken unit tests; adjust test_get_volume_config() to reflect
the disk cache mode.
Thanks: Jason Dillaman of Ceph for observing the change in cache modes
in a downstream bug analysis, Daniel Berrangé for help in
analysis from a Nova libvirt driver POV, and Stefan Hajnoczi
from QEMU for help on I/O latency instrumentation with `perf`.
Closes-bug: 1706083
Change-Id: I4184382b49dd2193d6a21bfe02ea973d02d8b09f
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1706083
Title:
Post-migration, Cinder volumes lose disk cache value, resulting in I/O
latency
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) newton series:
Confirmed
Status in OpenStack Compute (nova) ocata series:
Confirmed
Bug description:
Description
===========
[This was initially reported by a Red Hat OSP customer.]
The I/O latency of a Cinder volume after live migration of an instance
to which it's attached increases significantly. This stays increased
till the VM is stopped and started again. [VM is booted with Cinder
volume.
This is not the case when using a disk from a Nova store backend [
without Cinder volume] -- or at least the difference isn't so
significantly high after a live migration.
The storage backend is Ceph 2.0.
How reproducible: Consistently
Steps to Reproduce
==================
(0) Both the Nova instances and Cinder volumes are located on Ceph
(1) Create a Nova instance with a Cinder volume attached to it
(2) Live migrate it to a target Compute node
(3) Run `ioping` (`ioping -c 10 .`) on the Cinder volume.
Alternatively, run other I/O benchmarks like using `fio` with
'direct=1' (which uses non-bufferred I/O) as a good sanity check to
get a second opinion regarding latency.
Actual result
=============
Before live migration: `ioping` output on the Cinder volume attached to a Nova
instance:
[guest]$ sudo ioping -c 10 .
4 KiB <<< . (xfs /dev/sda1): request=1 time=98.0 us (warmup)
4 KiB <<< . (xfs /dev/sda1): request=2 time=135.6 us
4 KiB <<< . (xfs /dev/sda1): request=3 time=155.5 us
4 KiB <<< . (xfs /dev/sda1): request=4 time=161.7 us
4 KiB <<< . (xfs /dev/sda1): request=5 time=148.4 us
4 KiB <<< . (xfs /dev/sda1): request=6 time=354.3 us
4 KiB <<< . (xfs /dev/sda1): request=7 time=138.0 us (fast)
4 KiB <<< . (xfs /dev/sda1): request=8 time=150.7 us
4 KiB <<< . (xfs /dev/sda1): request=9 time=149.6 us
4 KiB <<< . (xfs /dev/sda1): request=10 time=138.6 us (fast)
--- . (xfs /dev/sda1) ioping statistics ---
9 requests completed in 1.53 ms, 36 KiB read, 5.87 k iops, 22.9 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 135.6 us / 170.3 us / 354.3 us / 65.6 us
After live migration, `ioping` output on the Cinder
[guest]$ sudo ioping -c 10 .
4 KiB <<< . (xfs /dev/sda1): request=1 time=1.03 ms (warmup)
4 KiB <<< . (xfs /dev/sda1): request=2 time=948.6 us
4 KiB <<< . (xfs /dev/sda1): request=3 time=955.7 us
4 KiB <<< . (xfs /dev/sda1): request=4 time=920.5 us
4 KiB <<< . (xfs /dev/sda1): request=5 time=1.03 ms
4 KiB <<< . (xfs /dev/sda1): request=6 time=838.2 us
4 KiB <<< . (xfs /dev/sda1): request=7 time=1.13 ms (slow)
4 KiB <<< . (xfs /dev/sda1): request=8 time=868.6 us
4 KiB <<< . (xfs /dev/sda1): request=9 time=985.2 us
4 KiB <<< . (xfs /dev/sda1): request=10 time=936.6 us
--- . (xfs /dev/sda1) ioping statistics ---
9 requests completed in 8.61 ms, 36 KiB read, 1.04 k iops, 4.08 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 838.2 us / 956.9 us / 1.13 ms / 81.0 us
This goes back to an average of 200us again after shutting down and
starting up the instance.
Expected result
===============
No I/O latency experienced on Cinder volumes.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1706083/+subscriptions
References