yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1296967] [NEW] instances stuck with task_state of REBOOTING after controller switchover

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Chris Friesen <chris.friesen@xxxxxxxxxxxxx>
Date: Mon, 24 Mar 2014 21:45:56 -0000
Reply-to: Bug 1296967 <1296967@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:


We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller:

1) We reboot the controller.
2) Right after it comes back up something calls compute.api.API.reboot() on an instance.
3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database.
4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
5) That message gets dropped on the floor due to communication issues between the controller and the compute.
6) Now we're stuck with a task_state of REBOOTING. 

Currently when doing a reboot we set the REBOOTING task_state in the
database in compute-api and then send an RPC cast. That seems awfully
risky given that if that message gets lost or the call fails for any
reason we could end up stuck in the REBOOTING state forever.  I think it
might make sense to have the power state audit clear the REBOOTING state
if appropriate, but others with more experience should make that call.


It didn't happen to use, but I think we could get into this state another way:

1) nova-compute was running reboot_instance()
2) we reboot the controller
3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None.
4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state.  


The timeline that I have looks like this.  We had some buggy code that
sent all the instances for a reboot when the controller came up.  The
first two are in the controller logs below, and these are the ones that
failed.

controller: (running everything but nova-compute)
nova-api log:

/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798

After this there are other reboot requests for the other instances, and
those ones passed.


Interestingly, we later see this
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399


Presumably the 409 responses are because nova thinks that these instances are currently rebooting.


compute:
2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance
2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT
2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance
2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT
<etc>


As you can see, the two requests that got "lost" were sent during the period after the first batch of AMQP connections, but before the second batch.  I didn't log controller-side timestamps for the successful reboot requests but they were after the two failed ones.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1296967

Title:
  instances stuck with task_state of REBOOTING after controller
  switchover

Status in OpenStack Compute (Nova):
  New

Bug description:
  
  We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller:

  1) We reboot the controller.
  2) Right after it comes back up something calls compute.api.API.reboot() on an instance.
  3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database.
  4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
  5) That message gets dropped on the floor due to communication issues between the controller and the compute.
  6) Now we're stuck with a task_state of REBOOTING. 

  Currently when doing a reboot we set the REBOOTING task_state in the
  database in compute-api and then send an RPC cast. That seems awfully
  risky given that if that message gets lost or the call fails for any
  reason we could end up stuck in the REBOOTING state forever.  I think
  it might make sense to have the power state audit clear the REBOOTING
  state if appropriate, but others with more experience should make that
  call.

  
  It didn't happen to use, but I think we could get into this state another way:

  1) nova-compute was running reboot_instance()
  2) we reboot the controller
  3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None.
  4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state.  


  The timeline that I have looks like this.  We had some buggy code that
  sent all the instances for a reboot when the controller came up.  The
  first two are in the controller logs below, and these are the ones
  that failed.

  controller: (running everything but nova-compute)
  nova-api log:

  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798

  After this there are other reboot requests for the other instances,
  and those ones passed.

  
  Interestingly, we later see this
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399

  
  Presumably the 409 responses are because nova thinks that these instances are currently rebooting.


  compute:
  2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT
  2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance
  2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT
  2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT
  2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance
  2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT
  <etc>

  
  As you can see, the two requests that got "lost" were sent during the period after the first batch of AMQP connections, but before the second batch.  I didn't log controller-side timestamps for the successful reboot requests but they were after the two failed ones.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1296967/+subscriptions
Follow ups

[Bug 1296967] Re: instances stuck with task_state of REBOOTING after controller switchover
From: Markus Zoeller (markus_z), 2016-03-22
[Bug 1296967] [NEW] instances stuck with task_state of REBOOTING after controller switchover
From: Chris Friesen, 2014-03-24
References

[Bug 1296967] [NEW] instances stuck with task_state of REBOOTING after controller switchover
From: Chris Friesen, 2014-03-24