yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #90423
[Bug 1996788] Re: The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.
** Tags added: ovs
** Changed in: neutron
Status: New => Opinion
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1996788
Title:
The virtual network is broken on the node after neutron-openvswitch-
agent is restarted if RPC requests return an error for a while.
Status in neutron:
Opinion
Bug description:
We ran into a problem in our openstack cluster, when traffic does not go through the virtual network on the node on which the neutron-openvswitch-agent was restarted.
We had an update from one version of the Openstack to another and by chance we had a inconsistency of the DB and neutron-server: any port select from the DB returned an error.
For a while neutron-openvswitch-agent (just after restart) couldn't get any information via RCP in its rpc_loop iterations due to DB/neutron-server inconsistency.
But after updating the database, we got a broken virtual network on the node where the neutron-openvswitch-agent was restarted.
It seems to me that I have found a problem place in the logic of neutron-ovs-agent.
To demonstrate, better to emulate the RPC request fail from neutron-ovs-agent to neutron-server.
Here are the steps to reproduce on devstack setup from the master branch.
Two nodes: node0 is controller, node1 is compute.
0) Prepare a vxlan based network and a VM.
[root@node0 ~]# openstack network create net1
[root@node0 ~]# openstack subnet create sub1 --network net1 --subnet-range 192.168.1.0/24
[root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny --image cirros-0.5.2-x86_64-disk --host node1
Just after creating the VM, there is a message in the devstack@q-agt
logs:
Nov 16 09:53:35 node1 neutron-openvswitch-agent[374810]: INFO
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None
req-77753b72-cb23-4dae-b68a-7048b63faf8b None None] Assigning 1 as
local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466
So, local vlan which is used on node1 for the network is `1`
A ping from the node0 to the VM on node1 success works:
[root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 192.168.1.211
PING 192.168.1.211 (192.168.1.211) 56(84) bytes of data.
64 bytes from 192.168.1.211: icmp_seq=1 ttl=64 time=1.86 ms
64 bytes from 192.168.1.211: icmp_seq=2 ttl=64 time=0.891 ms
1) Now, please don't misunderstand me, I don't want to be read that I'm patching the code and then clearly something won't work,
I just want to emulate a problem that's hard enough to reproduce in a normal way but it can.
So, emulate a problem that method get_resource_by_id returns an error just after neutron-ovs-agent restart (RPC based method actually):
[root@node1 neutron]# git diff
diff --git a/neutron/agent/rpc.py b/neutron/agent/rpc.py
index 9a133afb07..299eb25981 100644
--- a/neutron/agent/rpc.py
+++ b/neutron/agent/rpc.py
@@ -327,6 +327,11 @@ class CacheBackedPluginApi(PluginApi):
def get_device_details(self, context, device, agent_id, host=None,
agent_restarted=False):
+ import time
+ if not hasattr(self, '_stime'):
+ self._stime = time.time()
+ if self._stime + 5 > time.time():
+ raise Exception('Emulate RPC error in get_resource_by_id call')
port_obj = self.remote_resource_cache.get_resource_by_id(
resources.PORT, device, agent_restarted)
if not port_obj:
Restart neutron-openvswitch-agent agent and try to ping after 1-2 mins:
[root@node1 ~]# systemctl restart devstack@q-agt
[root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping -c 2 192.168.1.234
PING 192.168.1.234 (192.168.1.234) 56(84) bytes of data.
--- 192.168.1.234 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1058ms
[root@node0 ~]#
Ping doesn't work.
Just after the neutron-ovs-agent restart and when the RPC starts working correctly, there are logs:
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Assigning 2 as local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.agent.securitygroups_rpc [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Preparing filters for devices {'40d82f69-274f-4de5-84d9-6290159f288b'}
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.agent.linux.openvswitch_firewall.firewall [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Initializing port 40d82f69-274f-4de5-84d9-6290159f288b that was already initialized.
So, `Assigning 2 as local vlan` followed by `Initializing port ...
that was already initialized.`
2) Using a pyrasite the eventlet backdoor was setup and I see that in
the internal structure inside the OVSFirewallDriver a `vlan_tag` of
the port is still `1` instead of `2`:
>>> import gc
>>> from neutron.agent.linux.openvswitch_firewall.firewall import OVSFirewallDriver
>>> for ob in gc.get_objects():
... if isinstance(ob, OVSFirewallDriver):
... break
...
>>> ob.sg_port_map.ports['40d82f69-274f-4de5-84d9-6290159f288b'].vlan_tag
1
>>>
So, the OVSFirewallDriver still thinks that the port has a local vlan
1, although at the ovs_neutron_agent level the local vlan 2 was
assigned.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1996788/+subscriptions
References