← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1227237] Re: quantum-dhcp-agent + quantum-dhcp-agent-dnsmasq-lease-update deadlock

 

** Changed in: neutron
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1227237

Title:
  quantum-dhcp-agent + quantum-dhcp-agent-dnsmasq-lease-update deadlock

Status in OpenStack Neutron (virtual network service):
  Invalid

Bug description:
  NB This isn't an issue in Havana since neutron doesn't use a DHCP
  lease update script (i.e., neutron runs dnsmasq without the --dhcp-
  script=xxx option).

  Every week or so on our OpenStack cluster of modest size (about a
  dozen nodes, dozens to hundreds of VMs being booted per day),
  instances stop receiving replies to their DHCP requests. When we
  observe the instances failing to communicate with the DHCP server,
  traces of network traffic show that the DHCP requests are making it to
  the tap device that dnsmasq is listening on but that dnsmasq isn't
  sending any replies. Furthermore, dnsmasq seems to be configured to
  send the replies (i.e., the appropriate entries are in its DHCP hosts
  file and dnsmasq has been sent SIGHUP). Once the dnsmasq process stops
  sending replies, it seems to have stopped indefinitely. Killing the
  dnsmasq process and restarting quantum-dhcp-agent, which starts a new
  dnsmasq process, gets replies going again.

  The dnsmasq process seems to have stopped sending replies because of
  its DHCP script quantum-dhcp-agent-dnsmasq-lease-update (i.e., from
  the option --dhcp-script=quantum-...). According to man dnsmasq(8),
  dnsmasq runs one copy of the lease update script at a time (i.e.,
  executes it serially) and some request processing blocks on execution
  of the lease update script. Assuming the blocked request processing
  includes replying to DHCP requests, then a single deadlocked instance
  of quantum-dhcp-agent-dnsmasq-lease-update explains why no DHCP
  responses are being made.

  Indeed, the script has deadlocked. Here's a snippet of output from 'ps
  auxf' showing the relevant dnsmasq and quantum-dhcp-agent-dnsmasq-
  lease-update and quantum-dhcp-agent processes:

  quantum  16347  0.6  0.0 198756 56084 ?        Ss   Sep10  71:50 python /usr/bin/quantum-dhcp-agent --config-file=/etc/quantum/quantum.conf --config-file=/etc/quantum/dhcp_agent.ini --log-file=/var/log/quantum/dhcp-agent.log
  ...
  nobody   19945  0.0  0.0  28820  1156 ?        S    Sep10   6:47 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal
  root     19948  0.0  0.0  28792   472 ?        S    Sep10   3:02  \_ dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap8591954d-46 --except-interface=lo --pid-file=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/pid --dhcp-hostsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/host --dhcp-optsfile=/var/lib/quantum/dhcp/773075d2-3df6-45e6-8c62-251a49fb8e1f/opts --dhcp-script=/usr/bin/quantum-dhcp-agent-dnsmasq-lease-update --leasefile-ro --dhcp-range=set:tag0,172.16.0.0,static,120s --conf-file= --domain=openstacklocal
  root     24722  0.0  0.0   4292   352 ?        S    Sep17   0:00      \_ quantum-dhcp-agent-dnsmasq-lease-update old fa:16:3e:17:5f:46 172.16.0.129 172-16-0-129

  Strace shows that 19945 is continuing to do stuff, but 19948 is
  blocked in a wait4 system call waiting for its child, 24722, to exit:

  peter@gremlin ~/ssst % sudo strace -p 19948
  Process 19948 attached - interrupt to quit
  wait4(-1, ^C <unfinished ...>

  But 24722 is blocked waiting to connect to the UDS that uses to
  communicate with quantum-dhcp-agent:

  peter@gremlin ~/ssst % sudo strace -p 24722
  Process 24722 attached - interrupt to quit
  connect(3, {sa_family=AF_FILE, path="/var/lib/quantum/dhcp/lease_relay"}, 110

  But quantum-dhcp-agent, pid 16347, is spinning epoll-ing on some epoll
  instance:

  peter@gremlin ~/ssst % sudo strace -p 16347
  Process 16347 attached - interrupt to quit
  epoll_wait(5, {}, 1023, 1549)           = 0
  read(4, "2+\214~\346\235,\6\264\320\27\"\\\17?\354", 16) = 16
  gettid()                                = 16347
  epoll_wait(5, {}, 1023, 826)            = 0
  epoll_wait(5, {}, 1023, 3999)           = 0
  epoll_wait(5, {}, 1023, 0)              = 0
  epoll_wait(5, {}, 1023, 3999)           = 0
  epoll_wait(5, {}, 1023, 3999)           = 0
  epoll_wait(5, 

  Clearly that epoll instance doesn't include the UDS that the lease
  update script is trying to connect to in spite of having the file
  open:

  peter@gremlin ~/ssst % sudo lsof | grep 16347 | grep lease_relay   
  python    16347         quantum    8u     unix 0xffff880070c86e80         0t0  224377723 /var/lib/quantum/dhcp/lease_relay

  I haven't looked into why quantum-dhcp-agent isn't epoll-ing on the
  UDS (fd 8). Presumably the epoll is the internal eventlet epoll.

  We've observed this on Canonical's 2013.1-0ubuntu2~cloud0 package
  (which is based on 2013.1 AFAIK) on Ubuntu 12.04 and 2013.1.2 on
  CentOS 6.4 from a recent RDO installation. Although I haven't tested
  2013.1.3 or the latest stable/grizzly code, I suspect the bug still
  exists because no similar bugs have been filed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1227237/+subscriptions