← Back to team overview

cloud-init team mailing list archive

chicken and egg network problem?

 

HI,

With version 0.7.8 we have run into a problem where cloud-init-local fails.

From cloud-init-output.log:

2016-10-06 19:31:25,216 - util.py[WARNING]: failed stage init-local
failed run of stage init-local
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/cmd/main.py", line
521, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python2.7/site-packages/cloudinit/cmd/main.py", line
265, in main_init
    init.apply_network_config(bring_up=not args.local)
  File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 631,
in apply_network_config
    netcfg, src = self._find_networking_config()
  File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 628,
in _find_networking_config
    return (net.generate_fallback_config(), "fallback")
  File "/usr/lib/python2.7/site-packages/cloudinit/net/__init__.py",
line 146, in generate_fallback_config
    carrier = int(sys_netdev_info(interface, 'carrier'))
  File "/usr/lib/python2.7/site-packages/cloudinit/net/__init__.py",
line 119, in sys_netdev_info
    data = util.load_file(fname)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 1269,
in load_file
    pipe_in_out(ifh, ofh, chunk_cb=read_cb)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 1312,
in pipe_in_out
    data = in_fh.read(chunk_size)
IOError: [Errno 22] Invalid argument

Getting EINVAL on read was a bit perplexing and it took me a while to
put the pieces together and I needed some help from people that know the
kernel code.

So the traceback is triggered when cloud-init code tries to read
"/sys/class/net/eth0/carrier". The relevant code in the kernel is:

static ssize_t carrier_show(struct device *dev,
                            struct device_attribute *attr, char *buf)
{
        struct net_device *netdev = to_net_dev(dev);
        if (netif_running(netdev)) {
                return sprintf(buf, fmt_dec, !!netif_carrier_ok(netdev));
        }
        return -EINVAL;
}

So basically the interface is not up when we try to read the carrier
flag. However as far as I know there is no state that is guaranteed to
be before configuration but after the network device being brought up.

So systemd doc states that "network-pre.target is a target that may be
used to order services before any network interface is configured."
which is clearly waht we want at this point in the cloud-init execution,
but apparently this does not guarantee that the interface is up, i.e. we
can read /sys/class/net/eth0/carrier. Unfortunately there appears to be
no guarantee from a networking perspective from systemd that we can hit
the desired timing. The other potential option would be
"After=network.target" but the doc says

"""
network.target has very little meaning during start-up. It only
indicates that the network management stack is up after it has been
reached. Whether any network interfaces are already configured when it
is reached is undefined.
"""

So if we were to use "After=network.target" it may be too late. Then
again, cloud-init is responsible for configuring the interface and thus
we could decide that "network.target" is good enough as we will know
that the network has not been configured but it will be in a state where
"/sys/class/net/eth0/carrier" can be read.

I tested this and that made the exception go away and
cloud-init-local.service succeeds.

Anyway, there is probably some aspect that I do not yet understand, but
I do know that at present we end up with a failure of the
cloud-init-local service, with a traceback, and that is not a good
situation.

Thoughts/comments?

Thanks,
Robert



-- 
Robert Schweikert                   MAY THE SOURCE BE WITH YOU
Public Cloud Architect                         LINUX
rjschwei@xxxxxxxx
IRC: robjo

Attachment: signature.asc
Description: OpenPGP digital signature