← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1398988] [NEW] dhcp-agent / network bindings out of sync, stopping dhcp-agent

 

Public bug reported:

dhcp-agent on a neutron host started dying with

---
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent Traceback (most recent call last):
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/amqp.py", line 462, in _process_data
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     **args)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 45, in dispatch
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     neutron_ctxt, version, method, namespace, **kwargs)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     result = getattr(proxyobj, method)(ctxt, **kwargs)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/dhcp_rpc_base.py", line 92, in get_active_networks_info
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     networks = self._get_active_networks(context, **kwargs)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/dhcp_rpc_base.py", line 42, in _get_active_networks
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     plugin.auto_schedule_networks(context, host)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 222, in auto_schedule_networks
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     self.network_scheduler.auto_schedule_networks(self, context, host)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/scheduler/dhcp_agent_scheduler.py", line 122, in auto_schedule_networks
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     context, [net_id], active=True)
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 126, in get_dhcp_agents_hosting_networks
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     binding.dhcp_agent)]
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 83, in is_eligible_agent
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     agent['heartbeat_timestamp'])
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent TypeError: 'NoneType' object has no attribute '__getitem__'
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
---

further investigation on the neutron server found that
networkdhcpbindings seems to have got out of sync

we can see just one dhcp-agent

---
MariaDB [neutron]> select id from agents where agent_type="DHCP agent";
+--------------------------------------+
| id                                   |
+--------------------------------------+
| 6923675d-5616-4ffe-b2c4-4d130f67973f |
+--------------------------------------+
1 row in set (0.00 sec)
---

but in the network bindings, at least 3 are listed

----
MariaDB [neutron]> select DISTINCT(dhcp_agent_id) from networkdhcpagentbindings;
+--------------------------------------+
| dhcp_agent_id                        |
+--------------------------------------+
| 6923675d-5616-4ffe-b2c4-4d130f67973f |
| b23f9f97-da04-4f61-bcfb-f8514e43cefd |
| d3e3ac5b-9962-428a-a9f8-6b2a1aba48d8 |
+--------------------------------------+
3 rows in set (0.00 sec)
---

neutron/db/agentschedulers_db.py:get_dhcp_agents_hosting_networks() [1]
doesn't expect this case of a binding with no associated agent, so it
goes off an tries to find the latest heartbeat for the agent, which
leads to the original traceback.

I unfortunately have no idea how this got out of sync; the host has been
upgraded from juno->kilo and neutron and dhcp-agents have been restarted
many, many times for various reasons.

I found two other references [2],[3]

[1] http://git.openstack.org/cgit/openstack/neutron/tree/neutron/db/agentschedulers_db.py#n107
[2] http://www.gossamer-threads.com/lists/openstack/dev/42060
[3] http://lists.openstack.org/pipermail/openstack/2013-October/002195.html

** Affects: neutron
     Importance: Undecided
     Assignee: Ian Wienand (iwienand)
         Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1398988

Title:
  dhcp-agent / network bindings out of sync, stopping dhcp-agent

Status in OpenStack Neutron (virtual network service):
  In Progress

Bug description:
  dhcp-agent on a neutron host started dying with

  ---
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent Traceback (most recent call last):
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/amqp.py", line 462, in _process_data
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     **args)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 45, in dispatch
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     neutron_ctxt, version, method, namespace, **kwargs)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     result = getattr(proxyobj, method)(ctxt, **kwargs)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/dhcp_rpc_base.py", line 92, in get_active_networks_info
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     networks = self._get_active_networks(context, **kwargs)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/dhcp_rpc_base.py", line 42, in _get_active_networks
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     plugin.auto_schedule_networks(context, host)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 222, in auto_schedule_networks
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     self.network_scheduler.auto_schedule_networks(self, context, host)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/scheduler/dhcp_agent_scheduler.py", line 122, in auto_schedule_networks
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     context, [net_id], active=True)
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 126, in get_dhcp_agents_hosting_networks
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     binding.dhcp_agent)]
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent   File "/usr/lib/python2.7/site-packages/neutron/db/agentschedulers_db.py", line 83, in is_eligible_agent
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent     agent['heartbeat_timestamp'])
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent TypeError: 'NoneType' object has no attribute '__getitem__'
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  2014-12-02 20:52:20.739 17097 TRACE neutron.agent.dhcp_agent 
  ---

  further investigation on the neutron server found that
  networkdhcpbindings seems to have got out of sync

  we can see just one dhcp-agent

  ---
  MariaDB [neutron]> select id from agents where agent_type="DHCP agent";
  +--------------------------------------+
  | id                                   |
  +--------------------------------------+
  | 6923675d-5616-4ffe-b2c4-4d130f67973f |
  +--------------------------------------+
  1 row in set (0.00 sec)
  ---

  but in the network bindings, at least 3 are listed

  ----
  MariaDB [neutron]> select DISTINCT(dhcp_agent_id) from networkdhcpagentbindings;
  +--------------------------------------+
  | dhcp_agent_id                        |
  +--------------------------------------+
  | 6923675d-5616-4ffe-b2c4-4d130f67973f |
  | b23f9f97-da04-4f61-bcfb-f8514e43cefd |
  | d3e3ac5b-9962-428a-a9f8-6b2a1aba48d8 |
  +--------------------------------------+
  3 rows in set (0.00 sec)
  ---

  neutron/db/agentschedulers_db.py:get_dhcp_agents_hosting_networks()
  [1] doesn't expect this case of a binding with no associated agent, so
  it goes off an tries to find the latest heartbeat for the agent, which
  leads to the original traceback.

  I unfortunately have no idea how this got out of sync; the host has
  been upgraded from juno->kilo and neutron and dhcp-agents have been
  restarted many, many times for various reasons.

  I found two other references [2],[3]

  [1] http://git.openstack.org/cgit/openstack/neutron/tree/neutron/db/agentschedulers_db.py#n107
  [2] http://www.gossamer-threads.com/lists/openstack/dev/42060
  [3] http://lists.openstack.org/pipermail/openstack/2013-October/002195.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1398988/+subscriptions


Follow ups

References