yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1817035] Re: eth0 lost carrier / down after restart and IP change on older EC2-classic instance

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Robert C Jennings <1817035@xxxxxxxxxxxxxxxxxx>
Date: Fri, 22 Feb 2019 07:00:13 -0000
Reply-to: Bug 1817035 <1817035@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

I am going to close out the cloud-images side of this bug as well. The
daily image for a release will contain the fix as soon as it is
released.

** Changed in: cloud-images
Status: New => Invalid

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1817035

Title:
eth0 lost carrier / down after restart and IP change on older
EC2-classic instance

Status in cloud-images:
Invalid
Status in cloud-init:
Invalid

Bug description:
I'm experiencing a consistent issue where older EC2 instance types
(e.g. c3.large) launched in EC2-Classic from the bionic AMI lose
network connection if they're stopped and subsequently restarted.

They work fine on the first boot, but when restarted they time out
both for things like SSH and also for EC2's status checks. They also
appear to have no outbound connection e.g. to the metadata service
etc. Rebooting does not resolve the issue, nor does stopping and
starting again.

On one occasion when testing, I resumed the instance very quickly and
Amazon allocated it the same IP address as before - the instance
booted with no problems. Normally however the instance gets a new IP
address - so it appears this may be related.

This is happening consistently with ami-08d658f84a6d84a80
(ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190212.1)
and I've also reproduced with ami-0c21eb76a5574aa2f (ubuntu/images
/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190210)

It does not happen if launching a newer instance type into EC2-VPC.

Steps to reproduce:

* Launch ami-08d658f84a6d84a80 on a c3.large in EC2-Classic, with a securing group allowing port 22 from anywhere and other configuration all as AWS defaults
* Wait for instance to boot, SSH to instance and observe all working normally. Wait for EC2 status checks to initialise and observe they pass.
* Stop instance
* Wait a minute or two - if restarted very rapidly AWS may reallocate the previous IP
* Start instance and observe it has been allocated a new IP address
* Wait a few minutes
* Attempt to SSH to the instance and observe the connection times out
* Observe that the EC2 instance reachability status check is failing
* Use the EC2 console to take an instance screenshot and observe that the console is showing the login prompt

By attaching the root volume from the broken instance to a new instance, I was able to capture and compare the syslog for the two boots. Both appear broadly similar at first, DHCP works as expected over eth0.

In both boots, systemd-networkd then reports "eth0: lost carrier".

On the successful boot, systemd-networkd almost immediately afterwards
then reports "eth0: gained carrier" and "eth0: IPv6 successfully
enabled". However on the failed boot these entries never appear.

Shortly afterwards cloud-init runs and on the success boot shows eth0
up with both IPv4 and IPv6 addresses, and valid routing tables. On the
failed boot it shows eth0 down, no IPv4 routing table and an empty
IPv6 routing table.

Also later on in the log from the failed boot amazon-ssm-agent.amazon-
ssm-agent reports that it cannot contact the metadata service (dial
tcp 169.254.169.254:80: connect: network is unreachable).

One thing I did notice is that the images don't appear to have been
configured to disable Predictable Network Interface Names. I've tried
changing that but it didn't resolve the issue. On reflection I think
that's perhaps unrelated, since presumably the interface names don't
change between a stop and start of the same instance on the same EC2
instance type, and the first boot works happily. Also the logs are all
consistently showing eth0 rather than one of the newer interface
names.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/1817035/+subscriptions