yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #56556
[Bug 1530275] Re: Live snapshot is corrupted (possibly race condition?)
Reviewed: https://review.openstack.org/363926
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b20239a5d293f55889cd1bffa59e4792c75edbf
Submitter: Jenkins
Branch: master
commit 6b20239a5d293f55889cd1bffa59e4792c75edbf
Author: Sławek Kapłoński <slawek@xxxxxxxxxxxx>
Date: Wed Aug 31 20:28:36 2016 +0000
Fix race condition bug during live_snapshot
During live_snapshot creation, when nova starts block_rebase
operation in libvirt there is possibility that block_job is
not yet started and libvirt blockJobInfo method will return
status.end = 0 and status.cur = 0. In such case libvirt driver
does not wait to finish block rebase operation and that causes
a problem because created snapshot is corrupted.
This patch adds check if status.end != 0 to return information
that job is already finished.
Change-Id: I45ac06eae0b1949f746dae305469718649bfcf23
Closes-Bug: #1530275
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1530275
Title:
Live snapshot is corrupted (possibly race condition?)
Status in OpenStack Compute (nova):
Fix Released
Bug description:
We are using nova 2:12.0.0-0ubuntu2~cloud0. Instance disks are stored
in qcow2 files on ext4 filesystem. When we live snapshot, 90% of the
time the produced image is corrupted; specifically, the image is only
a few megabytes (e.g. 30 MB) in size, while the disk size is several
GB. Here is the log from a corrupted snapshot:
2015-12-31 01:40:33.304 16805 INFO nova.compute.manager [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
2015-12-31 01:40:33.410 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
2015-12-31 01:40:34.964 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
2015-12-31 01:40:37.029 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete
The entire operation completes in a couple of seconds, which is
unexpected.
While testing, I added some sleep calls to the _live_snapshot function
in virt/libvirt/driver.py to debug the problem. A few live snapshot
runs were successful, but I'm not confident that it fixed the problem.
Anyway, here is the code that I changed:
try:
# NOTE (rmk): blockRebase cannot be executed on persistent
# domains, so we need to temporarily undefine it.
# If any part of this block fails, the domain is
# re-defined regardless.
if guest.has_persistent_configuration():
guest.delete_configuration()
# NOTE (rmk): Establish a temporary mirror of our root disk and
# issue an abort once we have a complete copy.
dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)
+ time.sleep(10.0)
while dev.wait_for_job():
- time.sleep(0.5)
+ time.sleep(5.0)
dev.abort_job()
libvirt_utils.chown(disk_delta, os.getuid())
finally:
self._host.write_instance_config(xml)
if require_quiesce:
self.unquiesce(context, instance, image_meta)
And the resulting log (which indicates that it is sleeping for not
just the initial 10 second call, but even more than that; this means
wait_for_job is returning false immediately before applying the
modification, but after the modification it is actually returning true
after the initial sleep and seems to be performing correctly):
2015-12-31 01:42:12.438 21232 INFO nova.compute.manager [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
2015-12-31 01:42:12.670 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
2015-12-31 01:43:02.411 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
2015-12-31 01:44:12.893 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete
Since sleeping 10 seconds before polling wait_for_job seemed to
resolve it, I think there may be a race condition where wait_for_job
may be called before the job is fully initialized from the rebase
call. I have not had a chance to explore that possibility further
though.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1530275/+subscriptions
References