yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1646212] [NEW] unshelved offloaded instance is unexpectedly stopped during ceph job

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem@xxxxxxxxxx>
Date: Wed, 30 Nov 2016 19:12:28 -0000
Reply-to: Bug 1646212 <1646212@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Noticed this during a ceph job tempest run:

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154

2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9]
2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------
2016-11-30 10:23:30.682275 | 
2016-11-30 10:23:30.682306 | Captured traceback-1:
2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~
2016-11-30 10:23:30.682397 |     Traceback (most recent call last):
2016-11-30 10:23:30.682467 |       File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown
2016-11-30 10:23:30.682522 |         self.server_check_teardown()
2016-11-30 10:23:30.682580 |       File "tempest/api/compute/base.py", line 166, in server_check_teardown
2016-11-30 10:23:30.682615 |         cls.server_id, 'ACTIVE')
2016-11-30 10:23:30.682697 |       File "tempest/common/waiters.py", line 95, in wait_for_server_status
2016-11-30 10:23:30.682751 |         raise lib_exc.TimeoutException(message)
2016-11-30 10:23:30.682802 |     tempest.lib.exceptions.TimeoutException: Request timed out
2016-11-30 10:23:30.682941 |     Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning.

Tracing the lifecycle of the instance through the n-cpu logs, there is a
race:

1. shelve

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926

/oslo_concurrency/lockutils.py:270
2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving

2. stopped

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938

2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req-
0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-
ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809]
[instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Instance shutdown
successfully after 13 seconds.

3. unshelve

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803

2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0
-c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809
tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-
b0f9-8e23fee18550] Unshelving

4. stop

http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921

2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440
2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event)

^ is 15 seconds after the instance was stopped during shelve, which is:

https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104

This is only a problem in the ceph job probably because it's using the
rbd shallow clone feature which makes the snapshot taken during shelve-
offloading fast super fast, and it's so fast that by the time the
initial lifecycle event from shelve / stop operation in #2 is processed,
we've already unshelved the instance and it's active with no task_state
set, so the compute manager doesn't think it's going through any task
and it's stopped because the libvirt driver says it should be stopped.

** Affects: nova
     Importance: Low
         Status: Triaged


** Tags: ceph gate race shelve

** Summary changed:

- unshelved offloaded instance is unexpectedly stopped
+ unshelved offloaded instance is unexpectedly stopped during ceph job

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646212

Title:
  unshelved offloaded instance is unexpectedly stopped during ceph job

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Noticed this during a ceph job tempest run:

  http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
  devstack-plugin-ceph-ubuntu-
  xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154

  2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9]
  2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------
  2016-11-30 10:23:30.682275 | 
  2016-11-30 10:23:30.682306 | Captured traceback-1:
  2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~
  2016-11-30 10:23:30.682397 |     Traceback (most recent call last):
  2016-11-30 10:23:30.682467 |       File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown
  2016-11-30 10:23:30.682522 |         self.server_check_teardown()
  2016-11-30 10:23:30.682580 |       File "tempest/api/compute/base.py", line 166, in server_check_teardown
  2016-11-30 10:23:30.682615 |         cls.server_id, 'ACTIVE')
  2016-11-30 10:23:30.682697 |       File "tempest/common/waiters.py", line 95, in wait_for_server_status
  2016-11-30 10:23:30.682751 |         raise lib_exc.TimeoutException(message)
  2016-11-30 10:23:30.682802 |     tempest.lib.exceptions.TimeoutException: Request timed out
  2016-11-30 10:23:30.682941 |     Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning.

  Tracing the lifecycle of the instance through the n-cpu logs, there is
  a race:

  1. shelve

  http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
  devstack-plugin-ceph-ubuntu-
  xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926

  /oslo_concurrency/lockutils.py:270
  2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving

  2. stopped

  http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
  devstack-plugin-ceph-ubuntu-
  xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938

  2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req-
  0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-
  ServerActionsTestJSON-107143809 tempest-
  ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-
  b0f9-8e23fee18550] Instance shutdown successfully after 13 seconds.

  3. unshelve

  http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
  devstack-plugin-ceph-ubuntu-
  xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803

  2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0
  -c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809
  tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3
  -4bdf-b0f9-8e23fee18550] Unshelving

  4. stop

  http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
  devstack-plugin-ceph-ubuntu-
  xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921

  2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440
  2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event)

  ^ is 15 seconds after the instance was stopped during shelve, which
  is:

  https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104

  This is only a problem in the ceph job probably because it's using the
  rbd shallow clone feature which makes the snapshot taken during
  shelve-offloading fast super fast, and it's so fast that by the time
  the initial lifecycle event from shelve / stop operation in #2 is
  processed, we've already unshelved the instance and it's active with
  no task_state set, so the compute manager doesn't think it's going
  through any task and it's stopped because the libvirt driver says it
  should be stopped.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1646212/+subscriptions