← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1817035] Re: eth0 lost carrier / down after restart and IP change on older EC2-classic instance

 

I am going to close out the cloud-images side of this bug as well.  The
daily image for a release will contain the fix as soon as it is
released.

** Changed in: cloud-images
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1817035

Title:
  eth0 lost carrier / down after restart and IP change on older
  EC2-classic instance

Status in cloud-images:
  Invalid
Status in cloud-init:
  Invalid

Bug description:
  I'm experiencing a consistent issue where older EC2 instance types
  (e.g. c3.large) launched in EC2-Classic from the bionic AMI lose
  network connection if they're stopped and subsequently restarted.

  They work fine on the first boot, but when restarted they time out
  both for things like SSH and also for EC2's status checks. They also
  appear to have no outbound connection e.g. to the metadata service
  etc. Rebooting does not resolve the issue, nor does stopping and
  starting again.

  On one occasion when testing, I resumed the instance very quickly and
  Amazon allocated it the same IP address as before - the instance
  booted with no problems. Normally however the instance gets a new IP
  address - so it appears this may be related.

  This is happening consistently with ami-08d658f84a6d84a80
  (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190212.1)
  and I've also reproduced with ami-0c21eb76a5574aa2f (ubuntu/images
  /hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190210)

  It does not happen if launching a newer instance type into EC2-VPC.

  Steps to reproduce:

  * Launch ami-08d658f84a6d84a80 on a c3.large in EC2-Classic, with a securing group allowing port 22 from anywhere and other configuration all as AWS defaults
  * Wait for instance to boot, SSH to instance and observe all working normally. Wait for EC2 status checks to initialise and observe they pass.
  * Stop instance
  * Wait a minute or two - if restarted very rapidly AWS may reallocate the previous IP
  * Start instance and observe it has been allocated a new IP address
  * Wait a few minutes
  * Attempt to SSH to the instance and observe the connection times out
  * Observe that the EC2 instance reachability status check is failing
  * Use the EC2 console to take an instance screenshot and observe that the console is showing the login prompt

  
  By attaching the root volume from the broken instance to a new instance, I was able to capture and compare the syslog for the two boots. Both appear broadly similar at first, DHCP works as expected over eth0.

  In both boots, systemd-networkd then reports "eth0: lost carrier".

  On the successful boot, systemd-networkd almost immediately afterwards
  then reports "eth0: gained carrier" and "eth0: IPv6 successfully
  enabled". However on the failed boot these entries never appear.

  Shortly afterwards cloud-init runs and on the success boot shows eth0
  up with both IPv4 and IPv6 addresses, and valid routing tables. On the
  failed boot it shows eth0 down, no IPv4 routing table and an empty
  IPv6 routing table.

  Also later on in the log from the failed boot amazon-ssm-agent.amazon-
  ssm-agent reports that it cannot contact the metadata service (dial
  tcp 169.254.169.254:80: connect: network is unreachable).

  One thing I did notice is that the images don't appear to have been
  configured to disable Predictable Network Interface Names. I've tried
  changing that but it didn't resolve the issue. On reflection I think
  that's perhaps unrelated, since presumably the interface names don't
  change between a stop and start of the same instance on the same EC2
  instance type, and the first boot works happily. Also the logs are all
  consistently showing eth0 rather than one of the newer interface
  names.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/1817035/+subscriptions