yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #59281
[Bug 1646212] [NEW] unshelved offloaded instance is unexpectedly stopped during ceph job
Public bug reported:
Noticed this during a ceph job tempest run:
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154
2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9]
2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------
2016-11-30 10:23:30.682275 |
2016-11-30 10:23:30.682306 | Captured traceback-1:
2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~
2016-11-30 10:23:30.682397 | Traceback (most recent call last):
2016-11-30 10:23:30.682467 | File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown
2016-11-30 10:23:30.682522 | self.server_check_teardown()
2016-11-30 10:23:30.682580 | File "tempest/api/compute/base.py", line 166, in server_check_teardown
2016-11-30 10:23:30.682615 | cls.server_id, 'ACTIVE')
2016-11-30 10:23:30.682697 | File "tempest/common/waiters.py", line 95, in wait_for_server_status
2016-11-30 10:23:30.682751 | raise lib_exc.TimeoutException(message)
2016-11-30 10:23:30.682802 | tempest.lib.exceptions.TimeoutException: Request timed out
2016-11-30 10:23:30.682941 | Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning.
Tracing the lifecycle of the instance through the n-cpu logs, there is a
race:
1. shelve
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926
/oslo_concurrency/lockutils.py:270
2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving
2. stopped
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938
2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req-
0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-
ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809]
[instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Instance shutdown
successfully after 13 seconds.
3. unshelve
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803
2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0
-c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809
tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-
b0f9-8e23fee18550] Unshelving
4. stop
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921
2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440
2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event)
^ is 15 seconds after the instance was stopped during shelve, which is:
https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104
This is only a problem in the ceph job probably because it's using the
rbd shallow clone feature which makes the snapshot taken during shelve-
offloading fast super fast, and it's so fast that by the time the
initial lifecycle event from shelve / stop operation in #2 is processed,
we've already unshelved the instance and it's active with no task_state
set, so the compute manager doesn't think it's going through any task
and it's stopped because the libvirt driver says it should be stopped.
** Affects: nova
Importance: Low
Status: Triaged
** Tags: ceph gate race shelve
** Summary changed:
- unshelved offloaded instance is unexpectedly stopped
+ unshelved offloaded instance is unexpectedly stopped during ceph job
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646212
Title:
unshelved offloaded instance is unexpectedly stopped during ceph job
Status in OpenStack Compute (nova):
Triaged
Bug description:
Noticed this during a ceph job tempest run:
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154
2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9]
2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------
2016-11-30 10:23:30.682275 |
2016-11-30 10:23:30.682306 | Captured traceback-1:
2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~
2016-11-30 10:23:30.682397 | Traceback (most recent call last):
2016-11-30 10:23:30.682467 | File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown
2016-11-30 10:23:30.682522 | self.server_check_teardown()
2016-11-30 10:23:30.682580 | File "tempest/api/compute/base.py", line 166, in server_check_teardown
2016-11-30 10:23:30.682615 | cls.server_id, 'ACTIVE')
2016-11-30 10:23:30.682697 | File "tempest/common/waiters.py", line 95, in wait_for_server_status
2016-11-30 10:23:30.682751 | raise lib_exc.TimeoutException(message)
2016-11-30 10:23:30.682802 | tempest.lib.exceptions.TimeoutException: Request timed out
2016-11-30 10:23:30.682941 | Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning.
Tracing the lifecycle of the instance through the n-cpu logs, there is
a race:
1. shelve
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926
/oslo_concurrency/lockutils.py:270
2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving
2. stopped
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938
2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req-
0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-
ServerActionsTestJSON-107143809 tempest-
ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-
b0f9-8e23fee18550] Instance shutdown successfully after 13 seconds.
3. unshelve
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803
2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0
-c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809
tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3
-4bdf-b0f9-8e23fee18550] Unshelving
4. stop
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full-
devstack-plugin-ceph-ubuntu-
xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921
2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440
2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event)
^ is 15 seconds after the instance was stopped during shelve, which
is:
https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104
This is only a problem in the ceph job probably because it's using the
rbd shallow clone feature which makes the snapshot taken during
shelve-offloading fast super fast, and it's so fast that by the time
the initial lifecycle event from shelve / stop operation in #2 is
processed, we've already unshelved the instance and it's active with
no task_state set, so the compute manager doesn't think it's going
through any task and it's stopped because the libvirt driver says it
should be stopped.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1646212/+subscriptions