yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1296967] Re: instances stuck with task_state of REBOOTING after controller switchover

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: "Markus Zoeller \(markus_z\)" <mzoeller@xxxxxxxxxx>
Date: Tue, 22 Mar 2016 18:33:13 -0000
Reply-to: Bug 1296967 <1296967@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
@Chris Friesen:
This bug report was opened against Havana and the confirmation was
from Melanie (comment #3) during the Juno cycle. You also state
that this is an issue which comes up in certain race conditions
which are hard to reproduce. Given the age and conditions of this
bug report, there is almost no chance to make progress here.
I'm going to deprecate it with "won't fix". If the issue
arises again, just reopen the report by setting it to "New".

** Changed in: nova
       Status: Incomplete => Won't Fix

** Changed in: nova
   Importance: High => Undecided

** Changed in: nova
    Milestone: ongoing => None

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1296967

Title:
  instances stuck with task_state of REBOOTING after controller
  switchover

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  We were doing some testing of Havana and have run into a scenario that
  ended up with two instances stuck with a task_state of REBOOTING
  following a reboot of the controller:

  1) We reboot the controller.
  2) Right after it comes back up something calls compute.api.API.reboot() on an instance.
  3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database.
  4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
  5) That message gets dropped on the floor due to communication issues between the controller and the compute.
  6) Now we're stuck with a task_state of REBOOTING.

  Currently when doing a reboot we set the REBOOTING task_state in the
  database in compute-api and then send an RPC cast. That seems awfully
  risky given that if that message gets lost or the call fails for any
  reason we could end up stuck in the REBOOTING state forever.  I think
  it might make sense to have the power state audit clear the REBOOTING
  state if appropriate, but others with more experience should make that
  call.

  It didn't happen to us, but I think we could get into this state
  another way:

  1) nova-compute was running reboot_instance()
  2) we reboot the controller
  3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None.
  4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state.

  The timeline that I have looks like this.  We had some buggy code that
  sent all the instances for a reboot when the controller came up.  The
  first two are in the controller logs below, and these are the ones
  that failed.

  controller: (running everything but nova-compute)
  nova-api log:

  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798

  After this there are other reboot requests for the other instances,
  and those ones passed.

  Interestingly, we later see this
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
  /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399

  Presumably the 409 responses are because nova thinks that these
  instances are currently rebooting.

  compute:
  2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
  2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT
  2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance
  2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT
  2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT
  2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance
  2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT
  <etc>

  As you can see, the two requests that got "lost" were sent during the
  period after the first batch of AMQP connections, but before the
  second batch.  I didn't log controller-side timestamps for the
  successful reboot requests but they were after the two failed ones.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1296967/+subscriptions
References

[Bug 1296967] [NEW] instances stuck with task_state of REBOOTING after controller switchover
From: Chris Friesen, 2014-03-24