← Back to team overview

sts-sponsors team mailing list archive

[Bug 1994002] Re: [SRU] migration was active, but no RAM info was set

 

These are the steps for a synthetic reproducer with GDB and QEMU.

It's sufficient to validate the change on QEMU and move forward with SRUs.
(We can add libvirt for reaching the error message too, but it's a plus.)

I'll check/coordinate on the QEMU uploads next week.

cheers,
Mauricio

...


Original code.

The race condition is, one thread can read 's->state' as SETUP (line 1078/1083/1086),
and another thread changes it to ACTIVE _before_ this thread assigns (the different)
's->state' to 'info->status' (line 1123), which will NOT have RAM info expected for
the ACTIVE status (line 1087/1096).

	1056 static void fill_source_migration_info(MigrationInfo *info)
	1057 {
	1058     MigrationState *s = migrate_get_current();
	...
	1078     switch (s->state) {
	...
	1083     case MIGRATION_STATUS_SETUP:
	...
	1086         break;
	1087     case MIGRATION_STATUS_ACTIVE:
	...
	1096         populate_ram_info(info, s);
	...
	1123     info->status = s->state;
	1124 }

We'll break on this function that changes migration states.

        void migrate_set_state(int *state, int old_state, int new_state)


Terminal 1)

QEMU process for incoming migration (ie, destination)

qemu-system-x86_64 -nodefaults -nographic -S -incoming tcp:0:4444

Terminal 2)

QEMU process for outgoing migration (ie, source)
with QEMU monitor on port 3333,
with GDB in non-stop threads mode
(and debuginfod for debug symbols),


gdb \
  -ex 'set non-stop on' -ex 'set pagination off' -ex 'set confirm off' \
  -iex 'set debuginfod enabled on' -iex 'set debuginfod urls "https://debuginfod.ubuntu.com";' \
  qemu-system-x86_64

Set breakpoints on the functions above
(and line number of the 'break' statement under 'case MIGRATION_STATUS_SETUP'):

b migrate_set_state
b migration/migration.c:1086

run -nodefaults -nographic -S -monitor tcp:0:3333,server,wait=off
...

Terminal 3)

Connect to QEMU monitor with netcat,
and start the the migration:

nc 127.0.0.1 3333
...
(qemu) migrate -d tcp:127.0.0.1:4444


Terminal 2)

GDB breaks on change from migration status NONE to SETUP.

Thread 1 "qemu-system-x86" hit Breakpoint 1, migrate_set_state
(state=0x5555566f8628, old_state=0, new_state=1) at
../../migration/migration.c:1746

(gdb) p (MigrationStatus) 0
$1 = MIGRATION_STATUS_NONE

(gdb) p (MigrationStatus) 1
$2 = MIGRATION_STATUS_SETUP

Just continue.

(gdb) c

GDB breaks on change from migration status SETUP to ACTIVE.

Thread 5 "qemu-system-x86" hit Breakpoint 1, migrate_set_state
(state=0x5555566f8628, old_state=1, new_state=4) at
../../migration/migration.c:1746

(gdb) p (MigrationStatus) 1
$2 = MIGRATION_STATUS_SETUP

(gdb) p (MigrationStatus) 4
$3 = MIGRATION_STATUS_ACTIVE

Let's explore the race condition.


Terminal 2)

Check the migration information:

(qemu) info migrate


Terminal 3)

GDB breaks on the migration information function.

The status is now observed as SETUP (not yet ACTIVE),
and is not yet propagated to the migration information
to be returned to the monitor.

Thread 1 "qemu-system-x86" hit Breakpoint 2, fill_source_migration_info
(info=0x555556d65c70) at ../../migration/migration.c:1086

(gdb) p (MigrationStatus) s.state
$4 = MIGRATION_STATUS_SETUP

(gdb) p info.status
$5 = MIGRATION_STATUS_NONE


Now, allow the other thread to continue and change status to ACTIVE.

(gdb) info threads
  Id   Target Id                                          Frame
* 1    Thread 0x7ffff6c21e80 (LWP 995) "qemu-system-x86"  fill_source_migration_info (info=0x555556d65c70) at ../../migration/migration.c:1123
  2    Thread 0x7ffff65ff640 (LWP 998) "qemu-system-x86"  (running)
  3    Thread 0x7ffff5d7c640 (LWP 999) "qemu-system-x86"  (running)
  5    Thread 0x7ffff49ff640 (LWP 1002) "qemu-system-x86" migrate_set_state (state=0x5555566f8628, old_state=1, new_state=4) at ../../migration/migration.c:1746

(gdb) thread 5
[Switching to thread 5 (Thread 0x7ffff49ff640 (LWP 1002))]
#0  migrate_set_state (state=0x5555566f8628, old_state=1, new_state=4) at ../../migration/migration.c:1746
1746    in ../../migration/migration.c

(gdb) continue &
Continuing.

The first thread now observes the ACTIVE status.

(gdb) info threads
  Id   Target Id                                          Frame
  1    Thread 0x7ffff6c21e80 (LWP 995) "qemu-system-x86"  fill_source_migration_info (info=0x555556d65c70) at ../../migration/migration.c:1123
  2    Thread 0x7ffff65ff640 (LWP 998) "qemu-system-x86"  (running)
  3    Thread 0x7ffff5d7c640 (LWP 999) "qemu-system-x86"  (running)
* 5    Thread 0x7ffff49ff640 (LWP 1002) "qemu-system-x86" (running)
(gdb)

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff6c21e80 (LWP 995))]
#0  fill_source_migration_info (info=0x555556d65c70) at ../../migration/migration.c:1123
1123    in ../../migration/migration.c

(gdb) p (MigrationStatus) s.state
$7 = MIGRATION_STATUS_ACTIVE


Done, the next statement in the code assigns the ACTIVE
status to the migration information, but it did NOT add
any RAM statistics (as it executed for the SETUP status).

Just continue, and check the resulting migration info:

(gdb) c


Terminal 2)
... info migrate
...
Migration status: active
total time: 0 ms
(qemu)

Migration status is active, without any RAM statistics.


...
...
...

Note that, normally, without exploring the race condition,
the status is 'active' and there _is_ memory statistics:

[just a summary from gdb/monitor steps]

...
(qemu) migrate -d tcp:127.0.0.1:4444

...
Thread 1 "qemu-system-x86" hit Breakpoint 1, migrate_set_state (state=0x5555566f8628, old_state=0, new_state=1) at ../../migration/migration.c:1746
(gdb) c
Thread 5 "qemu-system-x86" hit Breakpoint 1, migrate_set_state (state=0x5555566f8628, old_state=1, new_state=4) at ../../migration/migration.c:1746
(gdb) thread 5
(gdb) c
Thread 5 "qemu-system-x86" hit Breakpoint 1, migrate_set_state (state=0x5555566f8628, old_state=4, new_state=8) at ../../migration/migration.c:1746

(qemu) info migrate
...
Migration status: active
total time: 41387 ms
expected downtime: 300 ms
setup: 22590 ms
transferred ram: 485 kbytes
throughput: 0.18 mbps
remaining ram: 0 kbytes
total ram: 131592 kbytes
duplicate: 32849 pages
skipped: 0 pages
normal: 49 pages
normal bytes: 196 kbytes
dirty sync count: 3
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 1453
(qemu)

(gdb) c
...

-- 
You received this bug notification because you are a member of SE SRU
("STS") Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1994002

Title:
  [SRU] migration was active, but no RAM info was set

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Bionic:
  In Progress
Status in qemu source package in Focal:
  In Progress
Status in qemu source package in Jammy:
  In Progress
Status in qemu source package in Kinetic:
  In Progress

Bug description:
  [Impact]

   * While live-migrating many instances concurrently, libvirt sometimes
  return `internal error: migration was active, but no RAM info was
  set:`

   * Effects of this bug are mostly observed in large scale clusters
  with a lot of live migration activity.

   * Has second order effects for consumers of migration monitor such as
  libvirt and openstack.

  [Test Case]
  Steps to Reproduce:
  1. live evacuate a compute
  2. live migration of one or more instances fails with the above error

  N.B Due to the nature of this bug it is difficult consistently reproduce.
  In an environment where it has been observed it is estimated to occur approximately 1/1000 migrations.

  [Where problems could occur]
   * In the event of a regression the migration monitor may report an inconsistent state.

  [Original Bug Description]

  While live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set:
  ~~~
  2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set
  ~~~

  From upstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=2074205

  [Other Information]
  Related bug: https://bugs.launchpad.net/nova/+bug/1982284

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1994002/+subscriptions