yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #41656
[Bug 1518995] [NEW] l3-agent router sync unbinds port on down ovs-agent
Public bug reported:
Using ha-routers, we've found that getting network node agents down for
T>agent_down time, and then bringing them up fires a race condition
during ovs-agent and l3-agent boot.
Even if you set a constraint on ovs-agent being up before l3-agent, that
won't work, because still the l3-agent will start and sync routers
before the ovs-agent is able to report himself active back to neutron
server.
A possible solution:
Using systemd notify from the openvswitch-agent, so it won't notify
system-d about readiness until the first heartbeat has been sent to
neutron-server.
Other solutions could be detecting the unbound state ("Device xxxxx not
defined on plugin") from the ovs-agent and trying to rebind it if the
port is present.
How does it reproduce:
1) neutron-servers and agents are down by being set to standby
some time, T>agent_down_time
2) cluster is took out of standby, neutron-servers and agents are
started again, first neutron-server, then agents
3) neutron servers consider the agents to be down (last heartbeat >
agent_down_time), when checking ports in the database, reset them as
binding failed...
4) agents start, try to fetch info from the ports, as those are in
status binding failed... they get no info from the server (or the
interpretation is wrong "Device xxx not defined on plugin")
5) agents mark ports as dead vlan (4095), as the port "does not exist"
on the server
Analysis over the logs:
-- SRV2 tryies to bind port on dead agent (: same happening for SRV1 and
3 that's omitted---
2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] Attempting to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on
host neutron-n-0 for vnic_type normal with profile bind_port
/usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:574
2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
9b78-185ef15809e3 bind_port /usr/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/mech_agent.py:57
2015-08-25 09:37:11.689000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Checking agent: {'binary': u'neutron-
openvswitch-agent', 'description': None, 'admin_state_up': True,
'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5),
'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic':
u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch agent',
'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at':
datetime.datetime(2015, 8, 12, 10, 23, 16), 'configurations':
{u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28,
u'l2_population': False, u'tunnel_types': [],
u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} bind_port
/usr/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/mech_agent.py:65
2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Attempting to bind with dead agent:
{'binary': u'neutron-openvswitch-agent', 'description': None,
'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015,
8, 25, 1, 35, 5), 'alive': False, 'id':
u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host':
u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at':
datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at':
datetime.datetime(2015, 8, 12, 10, 23,16), 'configurations':
{u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28,
u'l2_population': False, u'tunnel_types': [],
u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}}
2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] Failed to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host
neutron-n-0
2015-08-25 09:37:11.719000 [SRV2] 16522 WARNING
neutron.plugins.ml2.plugin [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] In _notify_port_updated(), no bound segment for port 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
9b78-185ef15809e3
-- OVS agent 1 requests info for ports, including 6adcffbf-09d5-4a85-9339-9d6beb2bf82c --
2015-08-25 09:37:16.043000 [OVS1] 12613 DEBUG neutron.common.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSPluginApi method call called with arguments (<neutron.context.ContextBase object at 0x326d510>, {'args': {'host': 'neutron-n-2', 'agent_id': 'ovs-agent-neutron-n-2', 'devices': set([u'6adcffbf-09d5-4a85-9339-9d6beb2bf82c', u'2cdba01d-e48a-46a6-82c0-bbbfd8e2b490', u'fa58d644-8162-410b-8b45-747ae8813a61', u'5d5ea109-4807-4df3-bef4-b5d89c3ffebc', u'743ccaa6-7ed9-4195-aabd-3d55006338e1', u'dc662767-61a5-4807-b2ed-a7c76b541fd6', u'b5b271a1-d8b7-4b55-86fd-6316523878a5', u'b2d04f15-f2e2-429c-b948-00f88e1b9cf6', u'34b826df-9787-443c-9bef-084374827a85', u'c12e6e06-ff6a-44dc-b75f-78ec55dd3dd3', u'79b33879-3232-4b3a-a27c-c0a79da10379', u'499d3db7-9737-43a4-bcbf-923f9df5196e', u'55a19f5e-efff-4342-a82a-70ddb33fbefb', u'44dca3a9-44d4-418f-9d60-99dfe37afdc6'])}, 'namespace': None, 'method': 'get_devices_details_list'}) {'version': '1.3'} wrapper /usr/lib/python2.7/site-packages/neutron/common/log.py:33
---- SRV1 notes the port is in binding_failed state ----
2015-08-25 09:37:16.048000 [SRV1] 15065 DEBUG neutron.plugins.ml2.rpc
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c details requested by agent ovs-agent-
neutron-n-2 with host neutron-n-2 get_device_details /usr/lib/python2.7
/site-packages/neutron/plugins/ml2/rpc.py:60
2015-08-25 09:37:16.174000 [SRV1] 15065 WARNING neutron.plugins.ml2.rpc
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c requested by agent ovs-agent-neutron-n-2 on
network 2c52ca73-b11f-426f-9b78-185ef15809e3 not bound, vif_type:
binding_failed
2015-08-25 09:37:16.807000 [OVS1] 12613 WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c not defined on plugin
--- OVS1 tags it out ---
2015-08-25 09:37:16.900000 [OVS1] 12613 DEBUG neutron.agent.linux.utils
[req-588c942a-6526-464f-a447-782a5e2d436a None] Running command:
['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl',
'--timeout=10', 'set', 'Port', 'qr-6adcffbf-09', 'tag=4095']
create_process /usr/lib/python2.7/site-
packages/neutron/agent/linux/utils.py:48
** Affects: neutron
Importance: Low
Assignee: Jakub Libosvar (libosvar)
Status: New
** Tags: l3-ha ovs
** Changed in: neutron
Importance: Undecided => Low
** Tags added: ovs
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1518995
Title:
l3-agent router sync unbinds port on down ovs-agent
Status in neutron:
New
Bug description:
Using ha-routers, we've found that getting network node agents down
for T>agent_down time, and then bringing them up fires a race
condition during ovs-agent and l3-agent boot.
Even if you set a constraint on ovs-agent being up before l3-agent,
that won't work, because still the l3-agent will start and sync
routers before the ovs-agent is able to report himself active back to
neutron server.
A possible solution:
Using systemd notify from the openvswitch-agent, so it won't notify
system-d about readiness until the first heartbeat has been sent to
neutron-server.
Other solutions could be detecting the unbound state ("Device xxxxx
not defined on plugin") from the ovs-agent and trying to rebind it if
the port is present.
How does it reproduce:
1) neutron-servers and agents are down by being set to standby
some time, T>agent_down_time
2) cluster is took out of standby, neutron-servers and agents are
started again, first neutron-server, then agents
3) neutron servers consider the agents to be down (last heartbeat >
agent_down_time), when checking ports in the database, reset them as
binding failed...
4) agents start, try to fetch info from the ports, as those are in
status binding failed... they get no info from the server (or the
interpretation is wrong "Device xxx not defined on plugin")
5) agents mark ports as dead vlan (4095), as the port "does not exist"
on the server
Analysis over the logs:
-- SRV2 tryies to bind port on dead agent (: same happening for SRV1
and 3 that's omitted---
2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] Attempting to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on
host neutron-n-0 for vnic_type normal with profile bind_port
/usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:574
2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
9b78-185ef15809e3 bind_port /usr/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/mech_agent.py:57
2015-08-25 09:37:11.689000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Checking agent: {'binary': u'neutron-
openvswitch-agent', 'description': None, 'admin_state_up': True,
'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5),
'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930',
'topic': u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch
agent', 'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59),
'created_at': datetime.datetime(2015, 8, 12, 10, 23, 16),
'configurations': {u'arp_responder_enabled': False, u'tunneling_ip':
u'', u'devices': 28, u'l2_population': False, u'tunnel_types': [],
u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} bind_port
/usr/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/mech_agent.py:65
2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Attempting to bind with dead agent:
{'binary': u'neutron-openvswitch-agent', 'description': None,
'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015,
8, 25, 1, 35, 5), 'alive': False, 'id':
u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host':
u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at':
datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at':
datetime.datetime(2015, 8, 12, 10, 23,16), 'configurations':
{u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices':
28, u'l2_population': False, u'tunnel_types': [],
u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}}
2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] Failed to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host
neutron-n-0
2015-08-25 09:37:11.719000 [SRV2] 16522 WARNING
neutron.plugins.ml2.plugin [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] In _notify_port_updated(), no bound segment for port 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
9b78-185ef15809e3
-- OVS agent 1 requests info for ports, including 6adcffbf-09d5-4a85-9339-9d6beb2bf82c --
2015-08-25 09:37:16.043000 [OVS1] 12613 DEBUG neutron.common.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSPluginApi method call called with arguments (<neutron.context.ContextBase object at 0x326d510>, {'args': {'host': 'neutron-n-2', 'agent_id': 'ovs-agent-neutron-n-2', 'devices': set([u'6adcffbf-09d5-4a85-9339-9d6beb2bf82c', u'2cdba01d-e48a-46a6-82c0-bbbfd8e2b490', u'fa58d644-8162-410b-8b45-747ae8813a61', u'5d5ea109-4807-4df3-bef4-b5d89c3ffebc', u'743ccaa6-7ed9-4195-aabd-3d55006338e1', u'dc662767-61a5-4807-b2ed-a7c76b541fd6', u'b5b271a1-d8b7-4b55-86fd-6316523878a5', u'b2d04f15-f2e2-429c-b948-00f88e1b9cf6', u'34b826df-9787-443c-9bef-084374827a85', u'c12e6e06-ff6a-44dc-b75f-78ec55dd3dd3', u'79b33879-3232-4b3a-a27c-c0a79da10379', u'499d3db7-9737-43a4-bcbf-923f9df5196e', u'55a19f5e-efff-4342-a82a-70ddb33fbefb', u'44dca3a9-44d4-418f-9d60-99dfe37afdc6'])}, 'namespace': None, 'method': 'get_devices_details_list'}) {'version': '1.3'} wrapper /usr/lib/python2.7/site-packages/neutron/common/log.py:33
---- SRV1 notes the port is in binding_failed state ----
2015-08-25 09:37:16.048000 [SRV1] 15065 DEBUG neutron.plugins.ml2.rpc
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c details requested by agent ovs-agent-
neutron-n-2 with host neutron-n-2 get_device_details
/usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:60
2015-08-25 09:37:16.174000 [SRV1] 15065 WARNING
neutron.plugins.ml2.rpc [req-588c942a-6526-464f-a447-782a5e2d436a
None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c requested by agent
ovs-agent-neutron-n-2 on network 2c52ca73-b11f-426f-9b78-185ef15809e3
not bound, vif_type: binding_failed
2015-08-25 09:37:16.807000 [OVS1] 12613 WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c not defined on plugin
--- OVS1 tags it out ---
2015-08-25 09:37:16.900000 [OVS1] 12613 DEBUG
neutron.agent.linux.utils [req-588c942a-6526-464f-a447-782a5e2d436a
None] Running command: ['sudo', 'neutron-rootwrap',
'/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set',
'Port', 'qr-6adcffbf-09', 'tag=4095'] create_process
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1518995/+subscriptions
Follow ups