← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1744079] Re: [SRU] disk over-commit still not correctly calculated during live migration

 

** Tags added: sts-sponsor

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1744079

Title:
  [SRU] disk over-commit still not correctly calculated during live
  migration

Status in Ubuntu Cloud Archive:
  New
Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  [Impact]
  nova compares disk space with disk_available_least field, which is possible to be negative, due to overcommit.

  So the migration may fail because of a "Migration pre-check error:
  Unable to migrate dfcd087a-5dff-439d-8875-2f702f081539: Disk of
  instance is too large(available on destination host:-3221225472 <
  need:22806528)" when trying a migration to another compute that has
  plenty of free space in his disk.

  [Test Case]
  Deploy openstack environment. Make sure there is a negative disk_available_least and a adequate free_disk_gb in one test compute node, then migrate a VM to it with disk-overcommit (openstack server migrate --live <TEST-COMPUTE-NODE> --block-migration --disk-overcommit <VM-NAME>). You will see above migration pre-check error.

  This is the formula to compute disk_available_least and free_disk_gb.

  disk_free_gb = disk_info_dict['free']
  disk_over_committed = self._get_disk_over_committed_size_total()
  available_least = disk_free_gb * units.Gi - disk_over_committed
  data['disk_available_least'] = available_least / units.Gi

  The following command can be used to query the value of
  disk_available_least

  nova hypervisor-show <ID> |grep disk

  Steps to Reproduce:
  1. set disk_allocation_ratio config option > 1.0 
  2. qemu-img resize cirros-0.3.0-x86_64-disk.img +40G
  3. glance image-create --disk-format qcow2 ...
  4. boot VMs based on resized image
  5. we see disk_available_least becomes negative

  [Regression Potential]
  Minimal - we're just changing from the following line:

  disk_available_gb = dst_compute_info['disk_available_least']

  to the following codes:

  if disk_over_commit:
      disk_available_gb = dst_compute_info['free_disk_gb']
  else:
      disk_available_gb = dst_compute_info['disk_available_least']

  When enabling overcommit, disk_available_least is possible to be
  negative, so we should use free_disk_gb instead of it by backporting
  the following two fixes.

  https://git.openstack.org/cgit/openstack/nova/commit/?id=e097c001c8e11110efe8879da57264fcb7bdfdf2
  https://git.openstack.org/cgit/openstack/nova/commit/?id=e2cc275063658b23ed88824100919a6dfccb760d

  This is the code path for check_can_live_migrate_destination:

  _migrate_live(os-migrateLive API, migrate_server.py) -> migrate_server
  -> _live_migrate -> _build_live_migrate_task ->
  _call_livem_checks_on_host -> check_can_live_migrate_destination

  BTW, redhat also has a same bug -
  https://bugzilla.redhat.com/show_bug.cgi?id=1477706

  
  [Original Bug Report]
  Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me... sigh) addressed a bug which prevented live migration to a target host with overcommitted disk when made with microversion <2.25. It achieved this, but the fix is still not correct. We now do:

          if disk_over_commit:
              disk_available_gb = dst_compute_info['local_gb']

  Unfortunately local_gb is *total* disk, not available disk. We
  actually want free_disk_gb. Fun fact: due to the way we calculate this
  for filesystems, without taking into account reserved space, this can
  also be negative.

  The test we're currently running is: could we fit this guest's
  allocated disks on the target if the target disk was empty. This is at
  least better than it was before, as we don't spuriously fail early. In
  fact, we're effectively disabling a test which is disabled for
  microversion >=2.25 anyway. IOW we should fix it, but it's probably
  not a high priority.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1744079/+subscriptions


References