← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1708737] [NEW] ovs-fw fails to reinstate GRE connection in conntrack

 

Private bug reported:

High level description:

We have VMs running GRE tunneling between them successfully(with GRE
conntrack helper module loaded) with OVSFW implemented along with
security group.

But at times whenever there is an event causing exceptions in ovs
neutron agent causing agent out of sync with plugin, the GRE tunnel
between VMs break.

Example of events :

*Rabbitmq timeout*

2017-04-07 19:07:03.001 5275 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 4035644808d24ce9aae65a6ee567021c
2017-04-07 19:07:03.001 5275 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2017-04-07 19:07:03.003 5275 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 120.01 sec
2017-04-07 19:07:03.041 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Agent has just been revived. Doing a full sync.
2017-04-07 19:07:06.747 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] rpc_loop doing a full sync.
2017-04-07 19:07:06.841 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Agent out of sync with plugin!

*OVSFWPortNotFound*

2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     self.firewall.prepare_port_filter(device)
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/openstack/venvs/neutron-14.0.5/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py", line 272, in prepare_port_filter
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     of_port = self.get_or_create_ofport(port)
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/openstack/venvs/neutron-14.0.5/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py", line 246, in get_or_create_ofport
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     raise OVSFWPortNotFound(port_id=port_id)
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent OVSFWPortNotFound: Port 01f7c714-1828-4768-9810-a0ec25dd2b92 is not managed by this agent.
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2017-03-30 18:31:05.072 5160 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-db74f32b-5370-4a5f-86bf-935eba1490d0 - - - - -] Agent out of sync with plugin!


The ovs firewall and security group code starts to prepare filters for the neutron ports  and do initialization ports again causing all security group rule refresh.

2017-04-07 19:07:07.110 5275 INFO neutron.agent.securitygroups_rpc [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Preparing filters for devices set([u'4b14619f-3b9e-4103-b9d7-9c7e52c797d8'])
2017-04-07 19:07:07.215 5275 ERROR neutron.agent.linux.openvswitch_firewall.firewall [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Initializing port 4b14619f-3b9e-4103-b9d7-9c7e52c797d8 that was already initialized.

During the above process, the OVS-FW code is marking the GRE connection
state as invalid and all egress GRE traffic gets effected.


root@server:/var/log# conntrack -L -o extended -p gre -f ipv4
ipv4     2 gre      47 178 src=2.2.2.203 dst=3.3.3.66 srckey=0x0 dstkey=0x0 src=3.3.3.66 dst=2.2.2.203 srckey=0x0 dstkey=0x0 [ASSURED] mark=1 zone=5 use=1
ipv4     2 gre      47 179 src=4.4.4.104 dst=5.5.5.187 srckey=0x0 dstkey=0x0 src=5.5.5.187 dst=4.4.4.104 srckey=0x0 dstkey=0x0 [ASSURED] mark=0 zone=5 use=1


The above only happens intermittendly and that too on some high traffic connection ports. The above  state remains until someone reboots the VM, or flushes the connection directly on the conntrack or through OVS.

We have a blanket rule to allow any protocol any port with any IP. But
we also added specific rule to allow IP 47 GRE to particular block of
IP. But nothing fixes or reinstate the connections for the GRE tunnel.
All other connections works as expected except GRE.

Is this a bug with ovs-fw implementation or is it ovs-conntrack bug ?
We happened check some bugs in ovs-conntrack and came across OVS-
conntrack helpers bug - https://patchwork.ozlabs.org/patch/755615/ . Is
this bug being triggered here ?

** Affects: null-and-void
     Importance: Undecided
         Status: Invalid


** Tags: in-stable-newton ovs-fw sg-fw

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1708737

Title:
  ovs-fw fails to reinstate GRE connection in conntrack

Status in NULL Project:
  Invalid

Bug description:
  High level description:

  We have VMs running GRE tunneling between them successfully(with GRE
  conntrack helper module loaded) with OVSFW implemented along with
  security group.

  But at times whenever there is an event causing exceptions in ovs
  neutron agent causing agent out of sync with plugin, the GRE tunnel
  between VMs break.

  Example of events :

  *Rabbitmq timeout*

  2017-04-07 19:07:03.001 5275 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 4035644808d24ce9aae65a6ee567021c
  2017-04-07 19:07:03.001 5275 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  2017-04-07 19:07:03.003 5275 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 120.01 sec
  2017-04-07 19:07:03.041 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Agent has just been revived. Doing a full sync.
  2017-04-07 19:07:06.747 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] rpc_loop doing a full sync.
  2017-04-07 19:07:06.841 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Agent out of sync with plugin!

  *OVSFWPortNotFound*

  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     self.firewall.prepare_port_filter(device)
  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/openstack/venvs/neutron-14.0.5/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py", line 272, in prepare_port_filter
  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     of_port = self.get_or_create_ofport(port)
  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/openstack/venvs/neutron-14.0.5/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py", line 246, in get_or_create_ofport
  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     raise OVSFWPortNotFound(port_id=port_id)
  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent OVSFWPortNotFound: Port 01f7c714-1828-4768-9810-a0ec25dd2b92 is not managed by this agent.
  2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  2017-03-30 18:31:05.072 5160 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-db74f32b-5370-4a5f-86bf-935eba1490d0 - - - - -] Agent out of sync with plugin!

  
  The ovs firewall and security group code starts to prepare filters for the neutron ports  and do initialization ports again causing all security group rule refresh.

  2017-04-07 19:07:07.110 5275 INFO neutron.agent.securitygroups_rpc [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Preparing filters for devices set([u'4b14619f-3b9e-4103-b9d7-9c7e52c797d8'])
  2017-04-07 19:07:07.215 5275 ERROR neutron.agent.linux.openvswitch_firewall.firewall [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Initializing port 4b14619f-3b9e-4103-b9d7-9c7e52c797d8 that was already initialized.

  During the above process, the OVS-FW code is marking the GRE
  connection state as invalid and all egress GRE traffic gets effected.

  
  root@server:/var/log# conntrack -L -o extended -p gre -f ipv4
  ipv4     2 gre      47 178 src=2.2.2.203 dst=3.3.3.66 srckey=0x0 dstkey=0x0 src=3.3.3.66 dst=2.2.2.203 srckey=0x0 dstkey=0x0 [ASSURED] mark=1 zone=5 use=1
  ipv4     2 gre      47 179 src=4.4.4.104 dst=5.5.5.187 srckey=0x0 dstkey=0x0 src=5.5.5.187 dst=4.4.4.104 srckey=0x0 dstkey=0x0 [ASSURED] mark=0 zone=5 use=1

  
  The above only happens intermittendly and that too on some high traffic connection ports. The above  state remains until someone reboots the VM, or flushes the connection directly on the conntrack or through OVS.

  We have a blanket rule to allow any protocol any port with any IP. But
  we also added specific rule to allow IP 47 GRE to particular block of
  IP. But nothing fixes or reinstate the connections for the GRE tunnel.
  All other connections works as expected except GRE.

  Is this a bug with ovs-fw implementation or is it ovs-conntrack bug ?
  We happened check some bugs in ovs-conntrack and came across OVS-
  conntrack helpers bug - https://patchwork.ozlabs.org/patch/755615/ .
  Is this bug being triggered here ?

To manage notifications about this bug go to:
https://bugs.launchpad.net/null-and-void/+bug/1708737/+subscriptions


Follow ups