yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #83439
[Bug 1889257] Re: Live migration of realtime instances is broken
Reviewed: https://review.opendev.org/743568
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b6aef1ec4f9848f85e1f367e560c2bdb703fa110
Submitter: Zuul
Branch: master
commit b6aef1ec4f9848f85e1f367e560c2bdb703fa110
Author: Stephen Finucane <stephenfin@xxxxxxxxxx>
Date: Tue Jul 28 16:22:39 2020 +0100
Handle multiple 'vcpusched' elements during live migrate
When live migrating a pinned instance, we recalculate pinning
information for the destination host and then update the instance's XML
before spawning the instance there. As part of the pinning information
recalculation, we must also recalculate information for realtime cores,
which are configured using the '<vcpusched>' element. The
'nova.virt.libvirt.migration._update_numa_xml' function, which handles
this updating, was assuming there would only be one of these elements.
This is a reasonably sane assumption since this is all we create in the
'nova.virt.libvirt.LibvirtDriver._get_guest_numa_config' function used
to generate the initial instance XML. However, a look at logs show that
at least some (all?) versions of libvirt actually rewrite the XML we're
providing them. Compare what is returned from '_get_guest_xml':
DEBUG nova.virt.libvirt.driver [...] [instance: ...] End _get_guest_xml xml=<domain type="kvm">
...
<cputune>
<shares>4096</shares>
...
<vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
</cputune>
...
</domain>
{{(pid=12600) _get_guest_xml /opt/stack/nova/nova/virt/libvirt/driver.py:6331}}
to what is seen when we enter '_update_numa_xml' (or via 'virsh dumpxml'
at any point after instance creation):
DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm">
...
<cputune>
<shares>4096</shares>
...
<vcpusched vcpus="2" scheduler="fifo" priority="1"/>
<vcpusched vcpus="3" scheduler="fifo" priority="1"/>
</cputune
...
</domain>
{{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:97}
The solution is simple: rather than trying to modify the existing XML,
simply scrap it and rebuild the elements from scratch. We should
probably do this for all elements, but that can/should be tackled
separately.
Change-Id: Ic01603a91f6099f1068af0e955f3e1056021d673
Signed-off-by: Stephen Finucane <stephenfin@xxxxxxxxxx>
Closes-Bug: #1889257
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1889257
Title:
Live migration of realtime instances is broken
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Attempting to live migrate an instance with realtime enabled fails on
master (commit d4c857dfcb1). This appears to be a bug with the live
migration of pinned instances feature introduced in Train.
# Steps to reproduce
Create a server using realtime attributes and then attempt to live
migrate it. For example:
$ openstack flavor create --ram 1024 --disk 0 --vcpu 4 \
--property 'hw:cpu_policy=dedicated' \
--property 'hw:cpu_realtime=yes' \
--property 'hw:cpu_realtime_mask=^0-1' \
realtime
$ openstack server create --os-compute-api-version=2.latest \
--flavor realtime --image cirros-0.5.1-x86_64-disk --nic none \
--boot-from-volume 1 --wait \
test.realtime
$ openstack server migrate --live-migration test.realtime
# Expected result
Instance should be live migrated.
# Actual result
The live migration never happens. Looking at the logs we see the
following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
timer()
File "/usr/local/lib/python3.6/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File "/usr/local/lib/python3.6/dist-packages/eventlet/event.py", line 175, in _do_send
waiter.switch(result)
File "/usr/local/lib/python3.6/dist-packages/eventlet/greenthread.py", line 221, in main
result = function(*args, **kwargs)
File "/opt/stack/nova/nova/utils.py", line 670, in context_wrapper
return func(*args, **kwargs)
File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8966, in _live_migration_operation
# is still ongoing, or failed
File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/local/lib/python3.6/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 8959, in _live_migration_operation
# 2. src==running, dst==paused
File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 658, in migrate
destination, params=params, flags=flags)
File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit
result = proxy_call(self._autowrap, f, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call
rv = execute(f, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute
six.reraise(c, e, tb)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker
rv = meth(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 1745, in migrateToURI3
if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirt.libvirtError: vcpussched attributes 'vcpus' must not overlap
Looking further, we see there are issues with the XML we are
generating for the destination. Compare what we have on the source
before updating the XML for the destination:
DEBUG nova.virt.libvirt.migration [-] _update_numa_xml input xml=<domain type="kvm">
...
<cputune>
<shares>4096</shares>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="4"/>
<vcpupin vcpu="3" cpuset="5"/>
<emulatorpin cpuset="0-1"/>
<vcpusched vcpus="2" scheduler="fifo" priority="1"/>
<vcpusched vcpus="3" scheduler="fifo" priority="1"/>
</cputune
...
</domain>
{{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:97}
To what we have after the update:
DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
...
<cputune>
<shares>4096</shares>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="4"/>
<vcpupin vcpu="3" cpuset="5"/>
<emulatorpin cpuset="0-1"/>
<vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
<vcpusched vcpus="3" scheduler="fifo" priority="1"/>
</cputune>
...
</domain>
{{(pid=12600) _update_numa_xml /opt/stack/nova/nova/virt/libvirt/migration.py:131}}
The issue is the 'vcpusched' elements. We're assuming there are only
one of these elements when updating the XML for the destination [1].
Have to figure out why there are multiple elements and how best to
handle this (likely by deleting and recreating everything).
I suspect the reason we didn't spot this is because libvirt is
rewriting the XML on us. This is what nova is providing libvirt upon
boot:
DEBUG nova.virt.libvirt.driver [...] [instance: ...] End _get_guest_xml xml=<domain type="kvm">
...
<cputune>
<shares>4096</shares>
<emulatorpin cpuset="0-1"/>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="4"/>
<vcpupin vcpu="3" cpuset="5"/>
<vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
</cputune>
...
</domain>
{{(pid=12600) _get_guest_xml /opt/stack/nova/nova/virt/libvirt/driver.py:6331}}
but that's changed by time we get to recalculating things.
The solution is probably to remove all 'vcpusched' elements and
recreate them, rather than trying to update stuff inline.
[1]
https://github.com/openstack/nova/blob/21.0.0/nova/virt/libvirt/migration.py#L152-L155
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1889257/+subscriptions
References