yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1530275] [NEW] Live snapshot is corrupted (possibly race condition?)

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Favyen Bastani <fbastani@xxxxxxxxxxxxx>
Date: Thu, 31 Dec 2015 06:57:20 -0000
Reply-to: Bug 1530275 <1530275@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

We are using nova 2:12.0.0-0ubuntu2~cloud0. Instance disks are stored in
qcow2 files on ext4 filesystem. When we live snapshot, 90% of the time
the produced image is corrupted; specifically, the image is only a few
megabytes (e.g. 30 MB) in size, while the disk size is several GB. Here
is the log from a corrupted snapshot:

2015-12-31 01:40:33.304 16805 INFO nova.compute.manager [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
2015-12-31 01:40:33.410 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
2015-12-31 01:40:34.964 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
2015-12-31 01:40:37.029 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

The entire operation completes in a couple of seconds, which is
unexpected.

While testing, I added some sleep calls to the _live_snapshot function
in virt/libvirt/driver.py to debug the problem. A few live snapshot runs
were successful, but I'm not confident that it fixed the problem.
Anyway, here is the code that I changed:

        try:
            # NOTE (rmk): blockRebase cannot be executed on persistent
            #             domains, so we need to temporarily undefine it.
            #             If any part of this block fails, the domain is
            #             re-defined regardless.
            if guest.has_persistent_configuration():
                guest.delete_configuration()

            # NOTE (rmk): Establish a temporary mirror of our root disk and
            #             issue an abort once we have a complete copy.
            dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)

+            time.sleep(10.0)
            while dev.wait_for_job():
-                time.sleep(0.5)
+                time.sleep(5.0)

            dev.abort_job()
            libvirt_utils.chown(disk_delta, os.getuid())
        finally:
            self._host.write_instance_config(xml)
            if require_quiesce:
                self.unquiesce(context, instance, image_meta)

And the resulting log (which indicates that it is sleeping for not just
the initial 10 second call, but even more than that):

2015-12-31 01:42:12.438 21232 INFO nova.compute.manager [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
2015-12-31 01:42:12.670 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
2015-12-31 01:43:02.411 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
2015-12-31 01:44:12.893 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

Since sleeping 10 seconds before polling wait_for_job seemed to resolve
it, I think there may be a race condition where wait_for_job may be
called before the job is fully initialized from the rebase call. I have
not had a chance to explore that possibility further though.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1530275

Title:
  Live snapshot is corrupted (possibly race condition?)

Status in OpenStack Compute (nova):
  New

Bug description:
  We are using nova 2:12.0.0-0ubuntu2~cloud0. Instance disks are stored
  in qcow2 files on ext4 filesystem. When we live snapshot, 90% of the
  time the produced image is corrupted; specifically, the image is only
  a few megabytes (e.g. 30 MB) in size, while the disk size is several
  GB. Here is the log from a corrupted snapshot:

  2015-12-31 01:40:33.304 16805 INFO nova.compute.manager [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
  2015-12-31 01:40:33.410 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
  2015-12-31 01:40:34.964 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
  2015-12-31 01:40:37.029 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

  The entire operation completes in a couple of seconds, which is
  unexpected.

  While testing, I added some sleep calls to the _live_snapshot function
  in virt/libvirt/driver.py to debug the problem. A few live snapshot
  runs were successful, but I'm not confident that it fixed the problem.
  Anyway, here is the code that I changed:

          try:
              # NOTE (rmk): blockRebase cannot be executed on persistent
              #             domains, so we need to temporarily undefine it.
              #             If any part of this block fails, the domain is
              #             re-defined regardless.
              if guest.has_persistent_configuration():
                  guest.delete_configuration()

              # NOTE (rmk): Establish a temporary mirror of our root disk and
              #             issue an abort once we have a complete copy.
              dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)

  +            time.sleep(10.0)
              while dev.wait_for_job():
  -                time.sleep(0.5)
  +                time.sleep(5.0)

              dev.abort_job()
              libvirt_utils.chown(disk_delta, os.getuid())
          finally:
              self._host.write_instance_config(xml)
              if require_quiesce:
                  self.unquiesce(context, instance, image_meta)

  And the resulting log (which indicates that it is sleeping for not
  just the initial 10 second call, but even more than that):

  2015-12-31 01:42:12.438 21232 INFO nova.compute.manager [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
  2015-12-31 01:42:12.670 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
  2015-12-31 01:43:02.411 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
  2015-12-31 01:44:12.893 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

  Since sleeping 10 seconds before polling wait_for_job seemed to
  resolve it, I think there may be a race condition where wait_for_job
  may be called before the job is fully initialized from the rebase
  call. I have not had a chance to explore that possibility further
  though.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1530275/+subscriptions
Follow ups

[Bug 1530275] Re: Live snapshot is corrupted (possibly race condition?)
From: OpenStack Infra, 2016-09-14