yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #79720
[Bug 1841182] [NEW] cloud-init fails when rebooting EC2 i3.metal instances
Public bug reported:
In order to collect boot-speed metrics I deploy/reboot/terminate several
EC2 instances per day. At a high level this is that the jobs do:
1. Deploy an instance and wait for cloud-init
to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance
This works in a fairly reliable way, but on i3.metal instances the
instance often fails to survive the reboot step. After a failed reboot
the instance state appears as "running", but it's unreachable via SSH.
By detaching the root volume and attaching it to another instance in the
same availability zone I've been able to inspect the logs, and problem
is a cloud-init failure. At a first glance of the logs it looks like
cloud-init doesn't like /var/lib/cloud/data/set-hostname being empty:
2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
ret = functor(name, args)
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
_maybe_set_hostname(init, stage='local', retry_stage='network')
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname
cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)
File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle
prev_hostname = util.load_json(util.load_file(prev_fn))
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json
decoded = json.loads(decode_binary(text))
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I'm not sure on where the actual problem is here. Is set-hostname
supposed to always contain something? Should cloud-init be able to
handle an empty set-hostname? Could the fact that the instance is
rebooted shortly after being deployed affect this?
The full logs are attached.
** Affects: cloud-init
Importance: Undecided
Status: New
** Attachment added: "cloud-init.tar.gz"
https://bugs.launchpad.net/bugs/1841182/+attachment/5284197/+files/cloud-init.tar.gz
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1841182
Title:
cloud-init fails when rebooting EC2 i3.metal instances
Status in cloud-init:
New
Bug description:
In order to collect boot-speed metrics I deploy/reboot/terminate
several EC2 instances per day. At a high level this is that the jobs
do:
1. Deploy an instance and wait for cloud-init
to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance
This works in a fairly reliable way, but on i3.metal instances the
instance often fails to survive the reboot step. After a failed reboot
the instance state appears as "running", but it's unreachable via SSH.
By detaching the root volume and attaching it to another instance in
the same availability zone I've been able to inspect the logs, and
problem is a cloud-init failure. At a first glance of the logs it
looks like cloud-init doesn't like /var/lib/cloud/data/set-hostname
being empty:
2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
ret = functor(name, args)
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
_maybe_set_hostname(init, stage='local', retry_stage='network')
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname
cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)
File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle
prev_hostname = util.load_json(util.load_file(prev_fn))
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json
decoded = json.loads(decode_binary(text))
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I'm not sure on where the actual problem is here. Is set-hostname
supposed to always contain something? Should cloud-init be able to
handle an empty set-hostname? Could the fact that the instance is
rebooted shortly after being deployed affect this?
The full logs are attached.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1841182/+subscriptions
Follow ups