yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1996788] Re: The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Bernard Cafarelli <1996788@xxxxxxxxxxxxxxxxxx>
Date: Fri, 18 Nov 2022 14:12:30 -0000
Reply-to: Bug 1996788 <1996788@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
** Tags added: ovs

** Changed in: neutron
       Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1996788

Title:
  The virtual network is broken on the node after neutron-openvswitch-
  agent is restarted if RPC requests return an error for a while.

Status in neutron:
  Opinion

Bug description:
  We ran into a problem in our openstack cluster, when traffic does not go through the virtual network on the node on which the neutron-openvswitch-agent was restarted.
  We had an update from one version of the Openstack to another and by chance we had a inconsistency of the DB and neutron-server: any port select from the DB returned an error.
  For a while neutron-openvswitch-agent (just after restart) couldn't get any information via RCP in its rpc_loop iterations due to DB/neutron-server inconsistency.
  But after updating the database, we got a broken virtual network on the node where the neutron-openvswitch-agent was restarted.

  It seems to me that I have found a problem place in the logic of neutron-ovs-agent.
  To demonstrate, better to emulate the RPC request fail from neutron-ovs-agent to neutron-server.

  Here are the steps to reproduce on devstack setup from the master branch.
  Two nodes: node0 is controller, node1 is compute.

  0) Prepare a vxlan based network and a VM.
  [root@node0 ~]# openstack network create net1
  [root@node0 ~]# openstack subnet create sub1 --network net1 --subnet-range 192.168.1.0/24
  [root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny --image cirros-0.5.2-x86_64-disk --host node1

  Just after creating the VM, there is a message in the devstack@q-agt
  logs:

  Nov 16 09:53:35 node1 neutron-openvswitch-agent[374810]: INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None
  req-77753b72-cb23-4dae-b68a-7048b63faf8b None None] Assigning 1 as
  local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466

  So, local vlan which is used on node1 for the network is `1`
  A ping from the node0 to the VM on node1 success works:

  [root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 192.168.1.211
  PING 192.168.1.211 (192.168.1.211) 56(84) bytes of data.
  64 bytes from 192.168.1.211: icmp_seq=1 ttl=64 time=1.86 ms
  64 bytes from 192.168.1.211: icmp_seq=2 ttl=64 time=0.891 ms

  1) Now, please don't misunderstand me, I don't want to be read that I'm patching the code and then clearly something won't work,
  I just want to emulate a problem that's hard enough to reproduce in a normal way but it can.
  So, emulate a problem that method get_resource_by_id returns an error just after neutron-ovs-agent restart (RPC based method actually):

  [root@node1 neutron]# git diff
  diff --git a/neutron/agent/rpc.py b/neutron/agent/rpc.py
  index 9a133afb07..299eb25981 100644
  --- a/neutron/agent/rpc.py
  +++ b/neutron/agent/rpc.py
  @@ -327,6 +327,11 @@ class CacheBackedPluginApi(PluginApi):

       def get_device_details(self, context, device, agent_id, host=None,
                              agent_restarted=False):
  +        import time
  +        if not hasattr(self, '_stime'):
  +            self._stime = time.time()
  +        if self._stime + 5 > time.time():
  +            raise Exception('Emulate RPC error in get_resource_by_id call')
           port_obj = self.remote_resource_cache.get_resource_by_id(
               resources.PORT, device, agent_restarted)
           if not port_obj:

  
  Restart neutron-openvswitch-agent agent and try to ping after 1-2 mins:

  [root@node1 ~]# systemctl restart devstack@q-agt

  [root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping -c 2 192.168.1.234
  PING 192.168.1.234 (192.168.1.234) 56(84) bytes of data.

  --- 192.168.1.234 ping statistics ---
  2 packets transmitted, 0 received, 100% packet loss, time 1058ms

  [root@node0 ~]#

  Ping doesn't work.
  Just after the neutron-ovs-agent restart and when the RPC starts working correctly, there are logs:

  Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Assigning 2 as local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466
  Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.agent.securitygroups_rpc [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Preparing filters for devices {'40d82f69-274f-4de5-84d9-6290159f288b'}
  Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.agent.linux.openvswitch_firewall.firewall [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Initializing port 40d82f69-274f-4de5-84d9-6290159f288b that was already initialized.

  So, `Assigning 2 as local vlan` followed by `Initializing port ...
  that was already initialized.`

  2) Using a pyrasite the eventlet backdoor was setup and I see that in
  the internal structure inside the OVSFirewallDriver a `vlan_tag` of
  the port is still `1` instead of `2`:

  >>> import gc
  >>> from neutron.agent.linux.openvswitch_firewall.firewall import OVSFirewallDriver
  >>> for ob in gc.get_objects():
  ...     if isinstance(ob, OVSFirewallDriver):
  ...             break
  ...
  >>> ob.sg_port_map.ports['40d82f69-274f-4de5-84d9-6290159f288b'].vlan_tag
  1
  >>>

  So, the OVSFirewallDriver still thinks that the port has a local vlan
  1, although at the ovs_neutron_agent level the local vlan 2 was
  assigned.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1996788/+subscriptions
References

[Bug 1996788] [NEW] The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.
From: Anton Kurbatov, 2022-11-16