yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1755981] [NEW] powering off and on an instance can result in instance boot failure due to serial port handling race

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Chris Friesen <chris.friesen@xxxxxxxxxxxxx>
Date: Thu, 15 Mar 2018 05:27:40 -0000
Reply-to: Bug 1755981 <1755981@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

The following is specific to the libvirt driver.

When we call power_off() it calls _destroy(), which in turn calls
self._get_serial_ports_from_guest() and loops over all the serial ports
calling serial_console.release_port() on each.  This removes the host
TCP port from ALLOCATED_PORTS (which is the set of allocated ports on
the host).

Then when we call power_on(), it again calls _destroy(), which again
calls self._get_serial_ports_from_guest().  This will return the same
set of ports that it did before.  This is a problem, because those ports
could have been allocated to another instance in the meantime!

So in the case where one or more of those ports had been allocated to
another instance, we call serial_console.release_port() on them, and
remove them from ALLOCATED_PORTS.

Then as part of power_on() we will create new XML with new serial ports,
which could select the ports that we just removed from ALLOCATED_PORTS
(which are actually in use by another instance).  When qemu tries to
bind to this port it will fail, causing the instance to error out and
stay in the SHUTOFF state.

One possible solution would be to call guest.detach_device() on the
"serial" and "console" devices from the guest in the power_off()
routine.  That way when we call _destroy() in the power_on() routine
there wouldn't be any devices returned by
_get_serial_ports_from_guest().  This is a bit messy though, so if
anyone has any better ideas I'd like to hear about it.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: compute libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1755981

Title:
  powering off and on an instance can result in instance boot failure
  due to serial port handling race

Status in OpenStack Compute (nova):
  New

Bug description:
  The following is specific to the libvirt driver.

  When we call power_off() it calls _destroy(), which in turn calls
  self._get_serial_ports_from_guest() and loops over all the serial
  ports calling serial_console.release_port() on each.  This removes the
  host TCP port from ALLOCATED_PORTS (which is the set of allocated
  ports on the host).

  Then when we call power_on(), it again calls _destroy(), which again
  calls self._get_serial_ports_from_guest().  This will return the same
  set of ports that it did before.  This is a problem, because those
  ports could have been allocated to another instance in the meantime!

  So in the case where one or more of those ports had been allocated to
  another instance, we call serial_console.release_port() on them, and
  remove them from ALLOCATED_PORTS.

  Then as part of power_on() we will create new XML with new serial
  ports, which could select the ports that we just removed from
  ALLOCATED_PORTS (which are actually in use by another instance).  When
  qemu tries to bind to this port it will fail, causing the instance to
  error out and stay in the SHUTOFF state.

  One possible solution would be to call guest.detach_device() on the
  "serial" and "console" devices from the guest in the power_off()
  routine.  That way when we call _destroy() in the power_on() routine
  there wouldn't be any devices returned by
  _get_serial_ports_from_guest().  This is a bit messy though, so if
  anyone has any better ideas I'd like to hear about it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1755981/+subscriptions