← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1889257] [NEW] Live migration of realtime instances is broken

 

Public bug reported:

Attempting to live migrate an instance with realtime enabled fails on
master (commit d4c857dfcb1). This appears to be a bug with the live
migration of pinned instances feature introduced in Train.

# Steps to reproduce

Create a server using realtime attributes and then attempt to live
migrate it. For example:

  $ openstack flavor create --ram 1024 --disk 0 --vcpu 4 \
    --property 'hw:cpu_policy=dedicated' \
    --property 'hw:cpu_realtime=yes' \
    --property 'hw:cpu_realtime_mask=^0-1' \
    realtime

  $ openstack server create --os-compute-api-version=2.latest \
    --flavor realtime --image cirros-0.5.1-x86_64-disk --nic none \
    --boot-from-volume 1 --wait \
    test.realtime

  $ openstack server migrate --live-migration test.realtime

# Expected result

Instance should be live migrated.

# Actual result

The live migration never happens. Looking at the logs we see the
following error:

  Traceback (most recent call last):
    File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
      timer()
    File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/event.py", line 175, in _do_send
      waiter.switch(result)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/greenthread.py", line 221, in main
      result = function(*args, **kwargs)
    File "/opt/stack/nova/nova/utils.py", line 670, in context_wrapper
      return func(*args, **kwargs)
    File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8966, in _live_migration_operation
      #     is still ongoing, or failed
    File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
      self.force_reraise()
    File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
      six.reraise(self.type_, self.value, self.tb)
    File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
      raise value
    File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8959, in _live_migration_operation
      #  2. src==running, dst==paused
    File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 658, in migrate
      destination, params=params, flags=flags)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit
      result = proxy_call(self._autowrap, f, *args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call
      rv = execute(f, *args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute
      six.reraise(c, e, tb)
    File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
      raise value
    File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker
      rv = meth(*args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 1745, in migrateToURI3
      if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
  libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap

Looking further, we see there are issues with the XML we are generating
for the destination. Compare what we have on the source before updating
the XML for the destination:

  DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm">
    ...
    <cputune>
      <shares>4096</shares>
      <vcpupin vcpu="0" cpuset="0"/>
      <vcpupin vcpu="1" cpuset="1"/>
      <vcpupin vcpu="2" cpuset="4"/>
      <vcpupin vcpu="3" cpuset="5"/>
      <emulatorpin cpuset="0-1"/>
      <vcpusched vcpus="2" scheduler="fifo" priority="1"/>
      <vcpusched vcpus="3" scheduler="fifo" priority="1"/>
    </cputune
    ...
  </domain>
   {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:97}

To what we have after the update:

  DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
    ...
    <cputune>
      <shares>4096</shares>
      <vcpupin vcpu="0" cpuset="0"/>
      <vcpupin vcpu="1" cpuset="1"/>
      <vcpupin vcpu="2" cpuset="4"/>
      <vcpupin vcpu="3" cpuset="5"/>
      <emulatorpin cpuset="0-1"/>
      <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
      <vcpusched vcpus="3" scheduler="fifo" priority="1"/>
    </cputune>
    ...
  </domain>
   {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:131}}

The issue is the 'vcpusched' elements. We're assuming there are only one
of these elements when updating the XML for the destination [1]. Have to
figure out why there are multiple elements and how best to handle this
(likely by deleting and recreating everything).

I suspect the reason we didn't spot this is because libvirt is rewriting
the XML on us. This is what nova is providing libvirt upon boot:

  DEBUG nova.virt.libvirt.driver [...] [instance: ...] End _get_guest_xml xml=<domain type="kvm">
    ...
    <cputune>
      <shares>4096</shares>
      <emulatorpin cpuset="0-1"/>
      <vcpupin vcpu="0" cpuset="0"/>
      <vcpupin vcpu="1" cpuset="1"/>
      <vcpupin vcpu="2" cpuset="4"/>
      <vcpupin vcpu="3" cpuset="5"/>
      <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
    </cputune>
    ...
  </domain>
   {{(pid=12600) _get_guest_xml /opt/stack/nova/nova/virt/libvirt/driver.py:6331}}

but that's changed by time we get to recalculating things.

The solution is probably to remove all 'vcpusched' elements and recreate
them, rather than trying to update stuff inline.

[1]
https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/migration.py#L152-L155

** Affects: nova
     Importance: Medium
     Assignee: Stephen Finucane (stephenfinucane)
         Status: Confirmed


** Tags: libvirt live-migration numa realtime

** Changed in: nova
       Status: New => Confirmed

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
     Assignee: (unassigned) => Stephen Finucane (stephenfinucane)

** Tags added: numa

** Tags added: libvirt live-migration

** Tags added: realtime

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1889257

Title:
  Live migration of realtime instances is broken

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  Attempting to live migrate an instance with realtime enabled fails on
  master (commit d4c857dfcb1). This appears to be a bug with the live
  migration of pinned instances feature introduced in Train.

  # Steps to reproduce

  Create a server using realtime attributes and then attempt to live
  migrate it. For example:

    $ openstack flavor create --ram 1024 --disk 0 --vcpu 4 \
      --property 'hw:cpu_policy=dedicated' \
      --property 'hw:cpu_realtime=yes' \
      --property 'hw:cpu_realtime_mask=^0-1' \
      realtime

    $ openstack server create --os-compute-api-version=2.latest \
      --flavor realtime --image cirros-0.5.1-x86_64-disk --nic none \
      --boot-from-volume 1 --wait \
      test.realtime

    $ openstack server migrate --live-migration test.realtime

  # Expected result

  Instance should be live migrated.

  # Actual result

  The live migration never happens. Looking at the logs we see the
  following error:

    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
        timer()
      File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
        cb(*args, **kw)
      File "/usr/local/lib/python3.6/dist-packages/eventlet/event.py", line 175, in _do_send
        waiter.switch(result)
      File "/usr/local/lib/python3.6/dist-packages/eventlet/greenthread.py", line 221, in main
        result = function(*args, **kwargs)
      File "/opt/stack/nova/nova/utils.py", line 670, in context_wrapper
        return func(*args, **kwargs)
      File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8966, in _live_migration_operation
        #     is still ongoing, or failed
      File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
        self.force_reraise()
      File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
        six.reraise(self.type_, self.value, self.tb)
      File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
        raise value
      File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8959, in _live_migration_operation
        #  2. src==running, dst==paused
      File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 658, in migrate
        destination, params=params, flags=flags)
      File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit
        result = proxy_call(self._autowrap, f, *args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call
        rv = execute(f, *args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute
        six.reraise(c, e, tb)
      File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
        raise value
      File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker
        rv = meth(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 1745, in migrateToURI3
        if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
    libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap

  Looking further, we see there are issues with the XML we are
  generating for the destination. Compare what we have on the source
  before updating the XML for the destination:

    DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm">
      ...
      <cputune>
        <shares>4096</shares>
        <vcpupin vcpu="0" cpuset="0"/>
        <vcpupin vcpu="1" cpuset="1"/>
        <vcpupin vcpu="2" cpuset="4"/>
        <vcpupin vcpu="3" cpuset="5"/>
        <emulatorpin cpuset="0-1"/>
        <vcpusched vcpus="2" scheduler="fifo" priority="1"/>
        <vcpusched vcpus="3" scheduler="fifo" priority="1"/>
      </cputune
      ...
    </domain>
     {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:97}

  To what we have after the update:

    DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
      ...
      <cputune>
        <shares>4096</shares>
        <vcpupin vcpu="0" cpuset="0"/>
        <vcpupin vcpu="1" cpuset="1"/>
        <vcpupin vcpu="2" cpuset="4"/>
        <vcpupin vcpu="3" cpuset="5"/>
        <emulatorpin cpuset="0-1"/>
        <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
        <vcpusched vcpus="3" scheduler="fifo" priority="1"/>
      </cputune>
      ...
    </domain>
     {{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:131}}

  The issue is the 'vcpusched' elements. We're assuming there are only
  one of these elements when updating the XML for the destination [1].
  Have to figure out why there are multiple elements and how best to
  handle this (likely by deleting and recreating everything).

  I suspect the reason we didn't spot this is because libvirt is
  rewriting the XML on us. This is what nova is providing libvirt upon
  boot:

    DEBUG nova.virt.libvirt.driver [...] [instance: ...] End _get_guest_xml xml=<domain type="kvm">
      ...
      <cputune>
        <shares>4096</shares>
        <emulatorpin cpuset="0-1"/>
        <vcpupin vcpu="0" cpuset="0"/>
        <vcpupin vcpu="1" cpuset="1"/>
        <vcpupin vcpu="2" cpuset="4"/>
        <vcpupin vcpu="3" cpuset="5"/>
        <vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
      </cputune>
      ...
    </domain>
     {{(pid=12600) _get_guest_xml /opt/stack/nova/nova/virt/libvirt/driver.py:6331}}

  but that's changed by time we get to recalculating things.

  The solution is probably to remove all 'vcpusched' elements and
  recreate them, rather than trying to update stuff inline.

  [1]
  https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/migration.py#L152-L155

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1889257/+subscriptions


Follow ups