yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1530275] Re: Live snapshot is corrupted (possibly race condition?)

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1530275@xxxxxxxxxxxxxxxxxx>
Date: Wed, 14 Sep 2016 23:01:45 -0000
Reply-to: Bug 1530275 <1530275@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.openstack.org/363926
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b20239a5d293f55889cd1bffa59e4792c75edbf
Submitter: Jenkins
Branch:    master

commit 6b20239a5d293f55889cd1bffa59e4792c75edbf
Author: Sławek Kapłoński <slawek@xxxxxxxxxxxx>
Date:   Wed Aug 31 20:28:36 2016 +0000

    Fix race condition bug during live_snapshot
    
    During live_snapshot creation, when nova starts block_rebase
    operation in libvirt there is possibility that block_job is
    not yet started and libvirt blockJobInfo method will return
    status.end = 0 and status.cur = 0. In such case libvirt driver
    does not wait to finish block rebase operation and that causes
    a problem because created snapshot is corrupted.
    
    This patch adds check if status.end != 0 to return information
    that job is already finished.
    
    Change-Id: I45ac06eae0b1949f746dae305469718649bfcf23
    Closes-Bug: #1530275


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1530275

Title:
  Live snapshot is corrupted (possibly race condition?)

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  We are using nova 2:12.0.0-0ubuntu2~cloud0. Instance disks are stored
  in qcow2 files on ext4 filesystem. When we live snapshot, 90% of the
  time the produced image is corrupted; specifically, the image is only
  a few megabytes (e.g. 30 MB) in size, while the disk size is several
  GB. Here is the log from a corrupted snapshot:

  2015-12-31 01:40:33.304 16805 INFO nova.compute.manager [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
  2015-12-31 01:40:33.410 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
  2015-12-31 01:40:34.964 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
  2015-12-31 01:40:37.029 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

  The entire operation completes in a couple of seconds, which is
  unexpected.

  While testing, I added some sleep calls to the _live_snapshot function
  in virt/libvirt/driver.py to debug the problem. A few live snapshot
  runs were successful, but I'm not confident that it fixed the problem.
  Anyway, here is the code that I changed:

          try:
              # NOTE (rmk): blockRebase cannot be executed on persistent
              #             domains, so we need to temporarily undefine it.
              #             If any part of this block fails, the domain is
              #             re-defined regardless.
              if guest.has_persistent_configuration():
                  guest.delete_configuration()

              # NOTE (rmk): Establish a temporary mirror of our root disk and
              #             issue an abort once we have a complete copy.
              dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)

  +            time.sleep(10.0)
              while dev.wait_for_job():
  -                time.sleep(0.5)
  +                time.sleep(5.0)

              dev.abort_job()
              libvirt_utils.chown(disk_delta, os.getuid())
          finally:
              self._host.write_instance_config(xml)
              if require_quiesce:
                  self.unquiesce(context, instance, image_meta)

  And the resulting log (which indicates that it is sleeping for not
  just the initial 10 second call, but even more than that; this means
  wait_for_job is returning false immediately before applying the
  modification, but after the modification it is actually returning true
  after the initial sleep and seems to be performing correctly):

  2015-12-31 01:42:12.438 21232 INFO nova.compute.manager [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
  2015-12-31 01:42:12.670 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
  2015-12-31 01:43:02.411 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
  2015-12-31 01:44:12.893 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

  Since sleeping 10 seconds before polling wait_for_job seemed to
  resolve it, I think there may be a race condition where wait_for_job
  may be called before the job is fully initialized from the rebase
  call. I have not had a chance to explore that possibility further
  though.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1530275/+subscriptions
References

[Bug 1530275] [NEW] Live snapshot is corrupted (possibly race condition?)
From: Favyen Bastani, 2015-12-31