yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #11515
[Bug 1227237] Re: quantum-dhcp-agent + quantum-dhcp-agent-dnsmasq-lease-update deadlock
** Changed in: neutron
Status: New => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1227237
Title:
quantum-dhcp-agent + quantum-dhcp-agent-dnsmasq-lease-update deadlock
Status in OpenStack Neutron (virtual network service):
Invalid
Bug description:
NB This isn't an issue in Havana since neutron doesn't use a DHCP
lease update script (i.e., neutron runs dnsmasq without the --dhcp-
script=xxx option).
Every week or so on our OpenStack cluster of modest size (about a
dozen nodes, dozens to hundreds of VMs being booted per day),
instances stop receiving replies to their DHCP requests. When we
observe the instances failing to communicate with the DHCP server,
traces of network traffic show that the DHCP requests are making it to
the tap device that dnsmasq is listening on but that dnsmasq isn't
sending any replies. Furthermore, dnsmasq seems to be configured to
send the replies (i.e., the appropriate entries are in its DHCP hosts
file and dnsmasq has been sent SIGHUP). Once the dnsmasq process stops
sending replies, it seems to have stopped indefinitely. Killing the
dnsmasq process and restarting quantum-dhcp-agent, which starts a new
dnsmasq process, gets replies going again.
The dnsmasq process seems to have stopped sending replies because of
its DHCP script quantum-dhcp-agent-dnsmasq-lease-update (i.e., from
the option --dhcp-script=quantum-...). According to man dnsmasq(8),
dnsmasq runs one copy of the lease update script at a time (i.e.,
executes it serially) and some request processing blocks on execution
of the lease update script. Assuming the blocked request processing
includes replying to DHCP requests, then a single deadlocked instance
of quantum-dhcp-agent-dnsmasq-lease-update explains why no DHCP
responses are being made.
Indeed, the script has deadlocked. Here's a snippet of output from 'ps
auxf' showing the relevant dnsmasq and quantum-dhcp-agent-dnsmasq-
lease-update and quantum-dhcp-agent processes:
quantum 16347 0.6 0.0 198756 56084 ? Ss Sep10 71:50 python /usr/bin/quantum-dhcp-agent --config-file=/etc/quantum/quantum.conf --config-file=/etc/quantum/dhcp_agent.ini --log-file=/var/log/quantum/dhcp-agent.log
...
nobody 19945 0.0 0.0 28820 1156 ? S Sep10 6:47 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal
root 19948 0.0 0.0 28792 472 ? S Sep10 3:02 \_ dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal
root 24722 0.0 0.0 4292 352 ? S Sep17 0:00 \_ quantum-dhcp-agent-dnsmasq-lease-update old fa:16:3e:17:5f:46 172.16.0.129 172-16-0-129
Strace shows that 19945 is continuing to do stuff, but 19948 is
blocked in a wait4 system call waiting for its child, 24722, to exit:
peter@gremlin ~/ssst % sudo strace -p 19948
Process 19948 attached - interrupt to quit
wait4(-1, ^C <unfinished ...>
But 24722 is blocked waiting to connect to the UDS that uses to
communicate with quantum-dhcp-agent:
peter@gremlin ~/ssst % sudo strace -p 24722
Process 24722 attached - interrupt to quit
connect(3, {sa_family=AF_FILE, path="/var/lib/quantum/dhcp/lease_relay"}, 110
But quantum-dhcp-agent, pid 16347, is spinning epoll-ing on some epoll
instance:
peter@gremlin ~/ssst % sudo strace -p 16347
Process 16347 attached - interrupt to quit
epoll_wait(5, {}, 1023, 1549) = 0
read(4, "2+\214~\346\235,\6\264\320\27\"\\\17?\354", 16) = 16
gettid() = 16347
epoll_wait(5, {}, 1023, 826) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 0) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5, {}, 1023, 3999) = 0
epoll_wait(5,
Clearly that epoll instance doesn't include the UDS that the lease
update script is trying to connect to in spite of having the file
open:
peter@gremlin ~/ssst % sudo lsof | grep 16347 | grep lease_relay
python 16347 quantum 8u unix 0xffff880070c86e80 0t0 224377723 /var/lib/quantum/dhcp/lease_relay
I haven't looked into why quantum-dhcp-agent isn't epoll-ing on the
UDS (fd 8). Presumably the epoll is the internal eventlet epoll.
We've observed this on Canonical's 2013.1-0ubuntu2~cloud0 package
(which is based on 2013.1 AFAIK) on Ubuntu 12.04 and 2013.1.2 on
CentOS 6.4 from a recent RDO installation. Although I haven't tested
2013.1.3 or the latest stable/grizzly code, I suspect the bug still
exists because no similar bugs have been filed.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1227237/+subscriptions