← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1841182] [NEW] cloud-init fails when rebooting EC2 i3.metal instances

 

Public bug reported:

In order to collect boot-speed metrics I deploy/reboot/terminate several
EC2 instances per day. At a high level this is that the jobs do:

1. Deploy an instance and wait for cloud-init
   to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance

This works in a fairly reliable way, but on i3.metal instances the
instance often fails to survive the reboot step. After a failed reboot
the instance state appears as "running", but it's unreachable via SSH.

By detaching the root volume and attaching it to another instance in the
same availability zone I've been able to inspect the logs, and problem
is a cloud-init failure. At a first glance of the logs it looks like
cloud-init doesn't like /var/lib/cloud/data/set-hostname being empty:


2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)    
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname    
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local                 
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local                   
Traceback (most recent call last):                                                  
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper    
    ret = functor(name, args)                                                       
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init     
    _maybe_set_hostname(init, stage='local', retry_stage='network')                 
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname    
    cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)              
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle    
    prev_hostname = util.load_json(util.load_file(prev_fn))                         
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json    
    decoded = json.loads(decode_binary(text))                                           
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads                    
    return _default_decoder.decode(s)                                               
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode                    
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())                               
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode                
    raise JSONDecodeError("Expecting value", s, err.value) from None                
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)             


I'm not sure on where the actual problem is here. Is set-hostname
supposed to always contain something? Should cloud-init be able to
handle an empty set-hostname? Could the fact that the instance is
rebooted shortly after being deployed affect this?

The full logs are attached.

** Affects: cloud-init
     Importance: Undecided
         Status: New

** Attachment added: "cloud-init.tar.gz"
   https://bugs.launchpad.net/bugs/1841182/+attachment/5284197/+files/cloud-init.tar.gz

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1841182

Title:
  cloud-init fails when rebooting EC2 i3.metal instances

Status in cloud-init:
  New

Bug description:
  In order to collect boot-speed metrics I deploy/reboot/terminate
  several EC2 instances per day. At a high level this is that the jobs
  do:

  1. Deploy an instance and wait for cloud-init
     to finish using `cloud-init status --wait`
  2. Collect and retrieve some logs via SSH/SFTP
  3. Reboot the instance using boto3's reboot()
  4. Collect some more logs
  5. Terminate the instance

  This works in a fairly reliable way, but on i3.metal instances the
  instance often fails to survive the reboot step. After a failed reboot
  the instance state appears as "running", but it's unreachable via SSH.

  By detaching the root volume and attaching it to another instance in
  the same availability zone I've been able to inspect the logs, and
  problem is a cloud-init failure. At a first glance of the logs it
  looks like cloud-init doesn't like /var/lib/cloud/data/set-hostname
  being empty:


  2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)    
  2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname    
  2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local                 
  2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local                   
  Traceback (most recent call last):                                                  
    File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper    
      ret = functor(name, args)                                                       
    File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init     
      _maybe_set_hostname(init, stage='local', retry_stage='network')                 
    File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname    
      cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)              
    File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle    
      prev_hostname = util.load_json(util.load_file(prev_fn))                         
    File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json    
      decoded = json.loads(decode_binary(text))                                           
    File "/usr/lib/python3.6/json/__init__.py", line 354, in loads                    
      return _default_decoder.decode(s)                                               
    File "/usr/lib/python3.6/json/decoder.py", line 339, in decode                    
      obj, end = self.raw_decode(s, idx=_w(s, 0).end())                               
    File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode                
      raise JSONDecodeError("Expecting value", s, err.value) from None                
  json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)             


  I'm not sure on where the actual problem is here. Is set-hostname
  supposed to always contain something? Should cloud-init be able to
  handle an empty set-hostname? Could the fact that the instance is
  rebooted shortly after being deployed affect this?

  The full logs are attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1841182/+subscriptions


Follow ups