← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1667735] Re: cloud-init doesn't retry metadata lookups and hangs forever if metadata is down

 

** Changed in: cloud-init (Ubuntu)
       Status: Fix Released => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1667735

Title:
  cloud-init doesn't retry metadata lookups and hangs forever if
  metadata is down

Status in cloud-init:
  Confirmed
Status in cloud-init package in Ubuntu:
  Confirmed
Status in cloud-init source package in Precise:
  Confirmed
Status in cloud-init source package in Trusty:
  Confirmed

Bug description:
  If a host SmartOS server is rebooted and the metadata service is not
  available, a KVM VM instance that use cloud-init (via the SmartOS
  datasource) will fail to start.

  If the metadata agent on the host server is not available the python
  code for cloud-init gets blocked forever waiting for data it will
  never receive. This causes the boot process for an instance to hang on
  cloud-init.

  This is problematic if there happens to be some reason the metadata
  agent is not available for any reason while a SmartOS KVM VM that
  relies on cloud-init is booting.

  From the engineer that worked on this (not the svadm command is run on
  the host SmartOS server):

  You can reproduce this by disabling the metadata service SmartOS host:

  svcadm disable metadata

  and then boot a KVM VM running an Ubuntu Certified Cloud image such
  as:

  c864f104-624c-43d2-835e-b49a39709b6b (ubuntu-certified-14.04
  20150225.2)

  when you do this, the VM's boot process will hang at cloud-init. If
  you then start the metadata service, cloud-init will not recover.

  On of our engineers who looked at this was able to cause forward
  progress by applying this patch:

  --- /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py.ori	2017-02-23 01:28:28.405885775 +0000
  +++ /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py	2017-02-23 01:35:51.281885775 +0000
  @@ -286,7 +286,7 @@
       if not seed_device:
           raise AttributeError("seed_device value is not set")

  -    ser = serial.Serial(seed_device, timeout=seed_timeout)
  +    ser = serial.Serial(seed_device, timeout=10)
       if not ser.isOpen():
           raise SystemError("Unable to open %s" % seed_device)

  which causes the following strace output:

  [pid  2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
  [pid  2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
  [pid  2119] write(5, "GET user-script\n", 16) = 16
  [pid  2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
  [pid  2119] close(5)                    = 0
  [pid  2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
  [pid  2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
  [pid  2119] write(5, "GET iptables_disable\n", 21) = 21
  [pid  2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
  [pid  2119] close(5)                    = 0

  instead of:

  [pid  1977] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
  [pid  1977] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
  [pid  1977] write(5, "GET base64_keys\n", 16) = 16
  [pid  1977] select(6, [5], [], [], NULL

  which you get without the patch (notice the NULL for the timeout
  parameter). The code that gets blocked in this version of cloud-init
  is:

      ser.write("GET %s\n" % noun.rstrip())
      status = str(ser.readline()).rstrip()

  in cloudinit/sources/DataSourceSmartOS.py. The ser.readline()
  documentation says

  (https://pyserial.readthedocs.io/en/latest/shortintro.html#readline):

  Be careful when using readline(). Do specify a timeout when opening
  the serial port otherwise it could block forever if no newline
  character is received. Also note that readlines() only works with a
  timeout. readlines() depends on having a timeout and interprets that
  as EOF (end of file). It raises an exception if the port is not opened
  correctly.

  which is exactly the situation we've hit here.

  It might be better to have a timeout but when the timeout is hit, the
  GET should be retried if there's no answer rather than moving on to
  the next key. A negative answer (NOTFOUND for example) should not be
  retried, only when there's no answer (because metadata is
  unavailable).

  Once this is resolved, it should be possible to start a VM with cloud-
  init and metadata disabled, and then enable metadata some time later
  and have the boot process complete at that time.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1667735/+subscriptions