← Back to team overview

openstack team mailing list archive

Dhcp lease errors in vlan mode

 

TL;DR

To fix issues with failed dhcp leases in vlan mode, upgrade to dnsmasq 2.6.1[1]

THE LONG VERSION

There is an issue with the way nova uses dnsmasq in VLAN mode. It starts up a single copy of dnsmasq for each vlan on the network host (or on every host in multi_host mode). The problem is in the way that dnsmasq binds to an ip address and port[2]. Both copies can respond to broadcast packet, but unicast packets can only be answered by one of the copies.

In nova this means that guests from only one project will get responses to their unicast dhcp renew requests.  Unicast projects from guests in other projects get ignored. What happens next is different depending on the guest os.  Linux generally will send a broadcast packet out after the unicast fails, and so the only effect is a small (tens of ms) hiccup while interface is reconfigured.  It can be much worse than that, however. I have seen cases where Windows just gives up and ends up with a non-configured interface.

This bug was first noticed by some users of openstack who rolled their own fix. Basically, on linux, if you set the SO_BINDTODEVICE socket option, it will allow different daemons to share the port and respond to unicast packets, as long as they listen on different interfaces. I managed to communicate with Simon Kelley, the maintainer of dnsmasq and he has integrated a fix[3] for the issue in the current version[1] of dnsmaq.

I don't know how may users out there are using vlan mode, but you should be able to deal with this issue by upgrading dnsmasq. It would be great if the various distributionss could upgrade as well, or at least try to patch in the fix[3]. If upgrading dnsmasq is out of the question, a possible workaround is to minimize lease renewals with something like the following combination of config options.

# release leases immediately on terminate
force_dhcp_release=true
# one week lease time
dhcp_lease_time=604800
# two week disassociate timeout
fixed_ip_disassociate_timeout=1209600

Vish

[1] http://www.thekelleys.org.uk/dnsmasq/dnsmasq-2.61.tar.gz

[2] http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2011q3/005233.html

[3] http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=9380ba70d67db6b69f817d8e318de5ba1e990b12

Follow ups