← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1719769] [NEW] Occasional network interruption with mark=1 in conntrack

 

Public bug reported:

If VM port's security group rules update frequently and network traffic is heavy.
There will be situation that OvS security group flows wrongly mark the conntrack to 1 and block the VM network connectivity.

If there are 2 VMs, VM A(192.168.111.234) and VM B(192.168.111.233), B allow ping from A.
We ping B from A forever.
There will be one conntrack rule in VM B's compute Host.
icmp     1 29 src=192.168.111.234 dst=192.168.111.233 type=8 code=0 id=29697 src=192.168.111.233 dst=192.168.111.234 type=0 code=0 id=29697 mark=0 zone=1 use=2

I try to simulate this issue because it's hard to reproduce this issue in normal way.
There is one precondition to notice:
If SG rules change on a port, SG flows on this port will be recreated.
Although all SG flows for this port will be added into OvS flows by
command 'ovs-ofctl add-flows' one-off, but flows will actually be
added into OvS flows one by one.

It's hard to reproduce this issue if we do not hack the codes. 
So I disable security group defer in codes to simulate. (change codes here: https://github.com/openstack/neutron/blob/master/neutron/agent/securitygroups_rpc.py#L132) 

Then I start neutron-openvswitch-agent with breakpoint on
https://github.com/openstack/neutron/blob/master/neutron/agent/linux/openvswitch_firewall/firewall.py#L1004

Now we will get mark=1 conntrack rule in VM B's compute Host:
icmp     1 29 src=192.168.111.234 dst=192.168.111.233 type=8 code=0 id=29697 src=192.168.111.233 dst=192.168.111.234 type=0 code=0 id=29697 mark=1 zone=1 use=1

Here after the port's security group rules flows added later, this
mark=1 conntrack rule will not deleted only if timeout for this rule.

In our OpenStack production environment, we encounter this issue and our vital system network disconnected.
The reason is that the VM port security rule change frequently and VM network traffic is heavy.

** Affects: neutron
     Importance: Undecided
     Assignee: Jesse (jesse-5)
         Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1719769

Title:
  Occasional network interruption with mark=1 in conntrack

Status in neutron:
  In Progress

Bug description:
  If VM port's security group rules update frequently and network traffic is heavy.
  There will be situation that OvS security group flows wrongly mark the conntrack to 1 and block the VM network connectivity.

  If there are 2 VMs, VM A(192.168.111.234) and VM B(192.168.111.233), B allow ping from A.
  We ping B from A forever.
  There will be one conntrack rule in VM B's compute Host.
  icmp     1 29 src=192.168.111.234 dst=192.168.111.233 type=8 code=0 id=29697 src=192.168.111.233 dst=192.168.111.234 type=0 code=0 id=29697 mark=0 zone=1 use=2

  I try to simulate this issue because it's hard to reproduce this issue in normal way.
  There is one precondition to notice:
  If SG rules change on a port, SG flows on this port will be recreated.
  Although all SG flows for this port will be added into OvS flows by
  command 'ovs-ofctl add-flows' one-off, but flows will actually be
  added into OvS flows one by one.

  It's hard to reproduce this issue if we do not hack the codes. 
  So I disable security group defer in codes to simulate. (change codes here: https://github.com/openstack/neutron/blob/master/neutron/agent/securitygroups_rpc.py#L132) 

  Then I start neutron-openvswitch-agent with breakpoint on
  https://github.com/openstack/neutron/blob/master/neutron/agent/linux/openvswitch_firewall/firewall.py#L1004

  Now we will get mark=1 conntrack rule in VM B's compute Host:
  icmp     1 29 src=192.168.111.234 dst=192.168.111.233 type=8 code=0 id=29697 src=192.168.111.233 dst=192.168.111.234 type=0 code=0 id=29697 mark=1 zone=1 use=1

  Here after the port's security group rules flows added later, this
  mark=1 conntrack rule will not deleted only if timeout for this rule.

  In our OpenStack production environment, we encounter this issue and our vital system network disconnected.
  The reason is that the VM port security rule change frequently and VM network traffic is heavy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1719769/+subscriptions