← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2128665] [NEW] Activate port bindings when instances are paused during live migration

 

Public bug reported:

Description
===========
This is more for gathering some feedbacks on improving live migrations, not a bug that's breaking basic functionalities. Please let me know if a blueprint should be created for this.

According to this PR that was merged back in 2018 [1], during live
migrations, when nova-compute on the source host receives libvirt's
`libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event, it will treat this
event as `virtevent.EVENT_LIFECYCLE_MIGRATION_COMPLETED` event, which
will run the `migrate_instance_start` function to activate the
instance's port bindings on the destination host in
`handle_lifecycle_event` function. As the commit message suggests, this
is very helpful for reducing the downtime of the instance in pre-copy
migrations, because after the instance has been paused on the source
host and resumed on the destination host, libvirt will also need to
clean up the stale QEMU process on the source host, which may take
additional time to finish and has no effect on the already-resumed QEMU
process on the destination host. Only after the cleanup is done, the
libvirt API will return and post-live-migration will then start. Thus,
the instance will be down longer if we activate port bindings during
post-live-migration.

Then, as Sean discovered in this bug report [2], it is possible for the
migration to fail after the instance has been paused on the source host
(with this suspended event generated), and nova-compute cannot properly
rollback the port binding changes if we have already activated the dest
port bindings. So, in the commit that fixes this issue [3], we check for
the migration status when `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED`
event is received to make sure nova-compute only activate dest port
bindings when the migration succeeds. However, the problem here is that
at the time the source host's nova-compute receives this event, some
tricky things may happen:

The overall workflow of live migrations on libvirt & QEMU, according to
my reading of the code, is:

1-(source_qemu) pause_instance -> 2-(source_libvirt)
instance_paused_event -> 3-(dest_qemu) confirm_migration_ok ->
4-(dest_libvirt) resume_instance -> 5-(source_libvirt)
clean_up_source_instance -> 6-(source_libvirt)
cleanup_ok_and_api_return_success.

What libvirt writes in their documentation for
`VIR_DOMAIN_JOB_COMPLETED` [4] is "Job has finished, but isn't cleaned
up". However, what I found instead with my setup is that when you try to
query the domain's job status when it is being cleaned up (i.e., between
step 5 - 6), the query will hang until the cleanup is finished (i.e., at
step 6), because libvirt most likely locks on the domain during this
process.

What this means for `get_job_info` is:

- If the query arrives between step 1 and 4, the job is not done yet, so the check will fail and the port binding will be activated during post-live-migration.
- If the query arrives between step 4 and 5, then libvirt may return `VIR_DOMAIN_JOB_COMPLETED`.
- If the query arrives after 5, then the query will be blocked until the cleanup is finished, so the port binding will be activated basically when the API returns success, which is roughly at the start of post-live-migration too.

Step 4 - 5 can be done very fast compared to step 1 - 4 and step 5 - 6
(especially for step 5 - 6, it can take a long time for libvirt to clean
up the QEMU process. For example, a 64 GiB instance backed by 4KiB
regular-size memory pages can take kernel seconds to free all used
memory resources), and thus the chance of nova-compute being able to
query the `VIR_DOMAIN_JOB_COMPLETED` job status without being blocked is
very low, which means the port binding will most likely still be
activated in post-live-migration. And since our goal is to activate port
bindings as early as possible, this is not ideal.

Also note that this (that they query may hang) applies to the
`_live_migration_monitor` loop in libvirt's driver too. Basically when
the loop is able to see `VIR_DOMAIN_JOB_COMPLETED`, the libvirt API also
returns and cleanup is done.

>From my experiment of migrating a small instance, the actual port
binding activation request was called in post-live-migration by
`_live_migration_monitor` (I added a simple statement to log the call
stack before sending port binding activation requests):

```
File "/usr/lib/python3/dist-packages/eventlet/greenpool.py", line 87, in _spawn_n_impl
    func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/futurist/_green.py", line 71, in __call__
    self.work.run()
File "/usr/lib/python3/dist-packages/futurist/_utils.py", line 49, in run
    result = self.fn(*self.args, **self.kwargs)
File "/usr/lib/python3/dist-packages/nova/utils.py", line 664, in context_wrapper
    return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 8988, in _do_live_migration
    self.driver.live_migration(context, instance, dest,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10521, in live_migration
    self._live_migration(context, instance, dest,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 11060, in _live_migration
    self._live_migration_monitor(context, instance, guest, dest,
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10973, in _live_migration_monitor
    post_method(context, instance, dest, block_migration,
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9269, in _post_live_migration_update_host
    self._post_live_migration(
File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 63, in wrapped
    return f(self, context, *args, **kw)
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 204, in decorated_function
    return function(self, context, *args, **kwargs)
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9360, in _post_live_migration
    self.network_api.migrate_instance_start(ctxt, instance, migration)
File "/usr/lib/python3/dist-packages/nova/network/neutron.py", line 3178, in migrate_instance_start
    LOG.info("\n".join([line.strip() for line in traceback.format_stack()]))
```

This means that the `get_job_info` in [3] was not able to activate port
binding sooner.

There are lots of details here, but a TL;DR is that querying job status
may not be that reliable. If we want to reduce the downtime of the
instance, we should still activate port bindings at
`VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event without checking the live
migration status, and rollback the port binding changes if the live
migration fails later. Here are my ideas on improving this:

- Let's activate port bindings as soon as we receive `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event, like how it's done in [1], without checking the status of the live migration.
- And let's add a rollback function to be called in `_rollback_live_migration` when live migration fails at the end, which will:
    - For each port, check if it has an active port binding to the source host.
    - If yes, then we can safely assume that we haven't got any `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event, and the port bindings on the destination host are not activated yet. So we do nothing.
    - If no, then we must had activated the port bindings on the destination host, so let's re-activate the port bindings on the source host.

Basically, I would like to write some code to address the issue that's
mention in [3]: "As a result, failed live migrations will inadvertantly
trigger activation of the port bindings on the destination host, which
deactivates the source host port bindings, and then
_rollback_live_migration will delete those activated dest host port
bindings and leave the source host port bindings deactivated."

Before I start testing and submit a fix in the future, I'd like to
gather some feedbacks on whether my idea makes sense. Thank you!


Steps to reproduce
==================
Run `openstack server migrate <test-instance> --live-migration --wait` with debug on and see when the port binding activation requests are sent in logs.

Expected result
===============
port binding activation requests should be sent as early as when the instance is paused on the source host.

Actual result
=============
port binding activation requests are sent roughly at the start of post-live-migration, which leads to additional downtime for the instance.

Environment
===========
OpenStack Release: Caracal/2024.1
nova-compute: 29.2.0
Hypervisor: Libvirt + QEMU/KVM
QEMU: 6.2.0
libvirt: 8.0.0
Kernel: 6.8.0-59-generic
OS: Ubuntu 22.04.5 LTS
Network: calico-felix

[1]: https://review.opendev.org/c/openstack/nova/+/434870
[2]: https://bugs.launchpad.net/nova/+bug/1788014
[3]: https://github.com/openstack/nova/commit/aa87b9c288d316b85079e681e0df24354ec1912c
[4]: https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainJobType

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2128665

Title:
  Activate port bindings when instances are paused during live migration

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  This is more for gathering some feedbacks on improving live migrations, not a bug that's breaking basic functionalities. Please let me know if a blueprint should be created for this.

  According to this PR that was merged back in 2018 [1], during live
  migrations, when nova-compute on the source host receives libvirt's
  `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event, it will treat
  this event as `virtevent.EVENT_LIFECYCLE_MIGRATION_COMPLETED` event,
  which will run the `migrate_instance_start` function to activate the
  instance's port bindings on the destination host in
  `handle_lifecycle_event` function. As the commit message suggests,
  this is very helpful for reducing the downtime of the instance in pre-
  copy migrations, because after the instance has been paused on the
  source host and resumed on the destination host, libvirt will also
  need to clean up the stale QEMU process on the source host, which may
  take additional time to finish and has no effect on the already-
  resumed QEMU process on the destination host. Only after the cleanup
  is done, the libvirt API will return and post-live-migration will then
  start. Thus, the instance will be down longer if we activate port
  bindings during post-live-migration.

  Then, as Sean discovered in this bug report [2], it is possible for
  the migration to fail after the instance has been paused on the source
  host (with this suspended event generated), and nova-compute cannot
  properly rollback the port binding changes if we have already
  activated the dest port bindings. So, in the commit that fixes this
  issue [3], we check for the migration status when
  `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event is received to
  make sure nova-compute only activate dest port bindings when the
  migration succeeds. However, the problem here is that at the time the
  source host's nova-compute receives this event, some tricky things may
  happen:

  The overall workflow of live migrations on libvirt & QEMU, according
  to my reading of the code, is:

  1-(source_qemu) pause_instance -> 2-(source_libvirt)
  instance_paused_event -> 3-(dest_qemu) confirm_migration_ok ->
  4-(dest_libvirt) resume_instance -> 5-(source_libvirt)
  clean_up_source_instance -> 6-(source_libvirt)
  cleanup_ok_and_api_return_success.

  What libvirt writes in their documentation for
  `VIR_DOMAIN_JOB_COMPLETED` [4] is "Job has finished, but isn't cleaned
  up". However, what I found instead with my setup is that when you try
  to query the domain's job status when it is being cleaned up (i.e.,
  between step 5 - 6), the query will hang until the cleanup is finished
  (i.e., at step 6), because libvirt most likely locks on the domain
  during this process.

  What this means for `get_job_info` is:

  - If the query arrives between step 1 and 4, the job is not done yet, so the check will fail and the port binding will be activated during post-live-migration.
  - If the query arrives between step 4 and 5, then libvirt may return `VIR_DOMAIN_JOB_COMPLETED`.
  - If the query arrives after 5, then the query will be blocked until the cleanup is finished, so the port binding will be activated basically when the API returns success, which is roughly at the start of post-live-migration too.

  Step 4 - 5 can be done very fast compared to step 1 - 4 and step 5 - 6
  (especially for step 5 - 6, it can take a long time for libvirt to
  clean up the QEMU process. For example, a 64 GiB instance backed by
  4KiB regular-size memory pages can take kernel seconds to free all
  used memory resources), and thus the chance of nova-compute being able
  to query the `VIR_DOMAIN_JOB_COMPLETED` job status without being
  blocked is very low, which means the port binding will most likely
  still be activated in post-live-migration. And since our goal is to
  activate port bindings as early as possible, this is not ideal.

  Also note that this (that they query may hang) applies to the
  `_live_migration_monitor` loop in libvirt's driver too. Basically when
  the loop is able to see `VIR_DOMAIN_JOB_COMPLETED`, the libvirt API
  also returns and cleanup is done.

  From my experiment of migrating a small instance, the actual port
  binding activation request was called in post-live-migration by
  `_live_migration_monitor` (I added a simple statement to log the call
  stack before sending port binding activation requests):

  ```
  File "/usr/lib/python3/dist-packages/eventlet/greenpool.py", line 87, in _spawn_n_impl
      func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/futurist/_green.py", line 71, in __call__
      self.work.run()
  File "/usr/lib/python3/dist-packages/futurist/_utils.py", line 49, in run
      result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3/dist-packages/nova/utils.py", line 664, in context_wrapper
      return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 8988, in _do_live_migration
      self.driver.live_migration(context, instance, dest,
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10521, in live_migration
      self._live_migration(context, instance, dest,
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 11060, in _live_migration
      self._live_migration_monitor(context, instance, guest, dest,
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10973, in _live_migration_monitor
      post_method(context, instance, dest, block_migration,
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9269, in _post_live_migration_update_host
      self._post_live_migration(
  File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 63, in wrapped
      return f(self, context, *args, **kw)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 204, in decorated_function
      return function(self, context, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9360, in _post_live_migration
      self.network_api.migrate_instance_start(ctxt, instance, migration)
  File "/usr/lib/python3/dist-packages/nova/network/neutron.py", line 3178, in migrate_instance_start
      LOG.info("\n".join([line.strip() for line in traceback.format_stack()]))
  ```

  This means that the `get_job_info` in [3] was not able to activate
  port binding sooner.

  There are lots of details here, but a TL;DR is that querying job
  status may not be that reliable. If we want to reduce the downtime of
  the instance, we should still activate port bindings at
  `VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event without checking the live
  migration status, and rollback the port binding changes if the live
  migration fails later. Here are my ideas on improving this:

  - Let's activate port bindings as soon as we receive `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event, like how it's done in [1], without checking the status of the live migration.
  - And let's add a rollback function to be called in `_rollback_live_migration` when live migration fails at the end, which will:
      - For each port, check if it has an active port binding to the source host.
      - If yes, then we can safely assume that we haven't got any `libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED` event, and the port bindings on the destination host are not activated yet. So we do nothing.
      - If no, then we must had activated the port bindings on the destination host, so let's re-activate the port bindings on the source host.

  Basically, I would like to write some code to address the issue that's
  mention in [3]: "As a result, failed live migrations will
  inadvertantly trigger activation of the port bindings on the
  destination host, which deactivates the source host port bindings, and
  then _rollback_live_migration will delete those activated dest host
  port bindings and leave the source host port bindings deactivated."

  Before I start testing and submit a fix in the future, I'd like to
  gather some feedbacks on whether my idea makes sense. Thank you!

  
  Steps to reproduce
  ==================
  Run `openstack server migrate <test-instance> --live-migration --wait` with debug on and see when the port binding activation requests are sent in logs.

  Expected result
  ===============
  port binding activation requests should be sent as early as when the instance is paused on the source host.

  Actual result
  =============
  port binding activation requests are sent roughly at the start of post-live-migration, which leads to additional downtime for the instance.

  Environment
  ===========
  OpenStack Release: Caracal/2024.1
  nova-compute: 29.2.0
  Hypervisor: Libvirt + QEMU/KVM
  QEMU: 6.2.0
  libvirt: 8.0.0
  Kernel: 6.8.0-59-generic
  OS: Ubuntu 22.04.5 LTS
  Network: calico-felix

  [1]: https://review.opendev.org/c/openstack/nova/+/434870
  [2]: https://bugs.launchpad.net/nova/+bug/1788014
  [3]: https://github.com/openstack/nova/commit/aa87b9c288d316b85079e681e0df24354ec1912c
  [4]: https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainJobType

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2128665/+subscriptions