yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1934917] Re: inconsistencies in OVS firewall on an agent restart

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1934917@xxxxxxxxxxxxxxxxxx>
Date: Tue, 10 May 2022 16:48:55 -0000
Reply-to: Bug 1934917 <1934917@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/806246
Committed: https://opendev.org/openstack/neutron/commit/ab84b7fb2b6febc9dfd9b0767be90fcb3277c192
Submitter: "Zuul (22348)"
Branch:    master

commit ab84b7fb2b6febc9dfd9b0767be90fcb3277c192
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date:   Thu Aug 26 16:54:13 2021 +0000

    Allow to process FW OF rules belonging to a port in a single operation
    
    This patch adds a new configuration variable to control the OVS
    OpenFlow rule processing operations:
    
    * ``openflow_processed_per_port``: by default "False". If enabled,
      all OpenFlow rules associated to a port will be processed at once,
      in one single transaction. If disabled, the flows will be processed
      in batches of "AGENT_RES_PROCESSING_STEP=100" number of OpenFlow
      rules.
    
    With ``openflow_processed_per_port`` enabled, all Firewall
    OpenFlow rules related to a port are processed in one transaction
    (executed in one single command). That ensures the rules are written
    atomically and apply all of them at the same time.
    
    That means all needed rules to handle the ingress and egress traffic
    of a port using the Open vSwitch Firewall, are committed in the OVS
    DB at the same time. That will prevent from partially applied OpenFlow
    sets in the Firewall and inconsistencies when applying new SG rules or
    during the OVS agent restart.
    
    That will override, if needed, the hard limit of
    "AGENT_RES_PROCESSING_STEP=100" OpenFlow rules that could be
    processed in OVS at once.
    
    If the default configuration values are not modified, the behaviour of
    the OVS library does not change.
    
    Closes-Bug: #1934917
    
    Change-Id: If4984dece266a789d607725f8497f1aac3d73d23


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1934917

Title:
  inconsistencies in OVS firewall on an agent restart

Status in neutron:
  Fix Released
Status in OpenStack Security Advisory:
  Won't Fix

Bug description:
  #### Summary

  On an pre-production OpenStack deployment, we observed he following
  during a restart of neutron-openvswitch-agent: some active flows that
  the OVS firewall was letting through based on SG rules before the
  restart, become marked as CT_MARK(CT_MARK_INVALID) ; their traffic is
  then dropped for a period of time that extends beyond the restart.
  The clearing of rules with the previous cookie does not resolve the
  issue.

  Digging this issue has led me to consider the hypothesis that during a
  restart, where neutron OVS agent is adding new rules with a new
  cookies  and ultimately removing rules from the previous run not
  marked with newer cookies, the assumption that the new rules do not
  interfere with the old ones was broken.

  Looking at how conjunction IDs are used has led me to see that:
  A) the code offers no guarantee that, on a restart, a conjunction ID used for some SG rule in the previous run does not end up being used for some other SG rule on the next run
  B) in a case where there is an unfortunate collision (same conj_id used for two different SGs over a restart) the way OVS rules are replaced leaves room for race conditions resulting in either legitimate traffic to be dropped or illegitimate traffic to be accepted

  (B) with "legitimate traffic to be dropped" matches the issue as we
  saw it on the deployment, and the restricted conditions on which (B)
  would occur.

  This bug report first provides details on the operational issue, but independently of the analysis of this case
  the design issue in neutron agent described in the second part is what this bug report really is about.

  Slawek and Rodolfo have already been exposed to the details explained
  here.

  ## Details on the issue observed on our deployment

  # Context:

  - Queens (RH OSP13 containers)
  - we focus on two compute nodes where VMs run to form a cluster (one VM per compute)
  - SG rules are in places to allow traffic to said VM (more on this below)
  - TCP traffic

  # Reproduction attempt

  neutron OVS agent restarted at 11:41:35 on hyp1001

      traffic is impacted (cluster healthcheck failure in the application that runs the VM)
      (We hadn't taken much traces for this step, we were only checking that reloading the ovs agent with debug logs was working as intended)

  neutron OVS agent restarted at 12:28:35 on hyp1001

      no impact on cluster traffic

  neutron OVS agent restarted at 12:34:48 on hyp12003

      VM impacted starting from "12:35:12" (24s after start of new
  agent)

  What follows is about this second occurence where traffic was
  impacted.

  extract from VM logs (redacted):

  2021-04-28 12:35:22,769   WARN  messages lost for 10.1s
  2021-04-28 12:35:32,775   WARN  messages lost for 20.0s

  When neutron OVS agent restarts, it is supposed to (a) redefine all the OVS rules, and then (b) remove the one that were existing before it's startup.
  The time at which the issue starts (12:35:12) is between (a) and (b).

  Time of (b) events for the different OVS bridges:

  2021-04-28T12:35:29	INFO	neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent	Cleaning stale br-int flows
  2021-04-28T12:35:32	INFO	neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent	Cleaning stale br-ovs-cp-bond2 flows
  2021-04-28T12:35:32	INFO	neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent	Cleaning stale br-tun flows

  Here is the ovs-dpctl dump-flows trace where we see the flows being
  marked as CT_MARK_INVALID

  > ---------- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows ------------
  > --- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:11+0000_dpctl.dropped-flows      2021-05-04 11:08:21.451676203 +0200
  > +++ dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows      2021-05-04 11:08:21.455676231 +0200
  > @@ -4,6 +4,7 @@
  >  recirc_id(0),tunnel(tun_id=0x40,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  >  recirc_id(0),tunnel(tun_id=0x43,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  >  recirc_id(0),tunnel(tun_id=0x44,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  >  recirc_id(0),tunnel(tun_id=0x4e,src=172.11.156.14,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  >  recirc_id(0),tunnel(tun_id=0x50,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  >  recirc_id(0),tunnel(tun_id=0x52,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  > @@ -18,3 +19,5 @@
  >  recirc_id(0),tunnel(tun_id=0xb,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
  >  recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(+new-est-rel-rpl-inv+trk),ct_zone(0x1),ct_mark(0),eth_type(0x0800),ipv4(src=10.92.0.0/255.254.0.0,proto=6,frag=no),tcp(dst=32768/0x8000), flags:S, actions:drop
  >  recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(-new-est-rel-rpl+inv+trk),eth_type(0x0800),ipv4(frag=no), flags:R., actions:drop
  > +recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop
  > +recirc_id(0xa503e),in_port(5),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop
  >
  >
  > See the two last dpctl dropped flows that appear between 12:35:11 and 12:35:16 ^^

  > Details:
  > grep a503[0e] dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:32+0000_dpctl
  > recirc_id(0),in_port(5),ct_state(-trk),eth(src=44:ec:ce:33:3f:46,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:13, bytes:1923, used:7.456s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa503e)
  > recirc_id(0),in_port(4),ct_state(-trk),eth(src=fa:16:3e:53:6e:1c,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:5886, bytes:724376, used:7.112s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa5030)
  > recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), packets:14, bytes:6045, used:7.112s, flags:P., actions:drop

  I have dug a bit more, and it precisely appear that the conjunctions
  used to match the source IP against our remote SG of interest changes
  at a time corresponding to where the packet drop happen.

  **Before restart**, our remote group uses conjunction id 38 (this
  example focuses on egress direction):

      R1: table=72, priority=73,ct_state=+est-rel-
  rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(38,1/2)
  cookie=0x3321e58c437bf4aa

  (same for many other 10.94.x.y IPs)

  This conjunction 38  is used to accept traffic for the port:

      R2:  table=72, priority=73,conj_id=38,ct_state=+est-rel-
  rpl,ip,reg5=0x8 actions=load:0x26->NXM_NX_REG7[],resubmit(,73)
  cookie=0x3321e58c437bf4aa

  (file is
  dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:11+0000_ofctl_br-int)

  **After the restart**, around the time where the drop start we see the
  above rule replaced by an action to set conjunction 86:

      R3: table=72, priority=73,ct_state=+est-rel-
  rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(86,1/2)
  cookie=0xecd4cf28852563cf

  actions to accept traffic on the port is also present:

      R4:  table=72, priority=73,conj_id=86,ct_state=+est-rel-
  rpl,ip,reg5=0x8 actions=load:0x56->NXM_NX_REG7[],resubmit(,73)
  cookie=0xecd4cf28852563cf

  (file is
  dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:16+0000_ofctl_br-int)

  ## Comments on this operational issue

  The setup we have uses a lot of conjunction IDs rules for SGs, due to
  the use of a large range of TCP ports in a security group rule (that
  are broken into multiple masks in OVS flow rules appearing with many
  OVS conjunctions).

  Given thatthe traffic is dropped based on ct_mark(0x1), I speculate
  that, because the transition of the rules from conjunction 38 to
  conjunction 86 is not atomic (if I'm right nothing in the OVS firewall
  design guarantees that, and I see batches of 100 packets mentioned in
  ovs-vswitchd.log), there could perhaps be a time window where R3 has
  replaced R1, but R4 is not in place yet.  Any packet of our flow
  hitting the pipeline at this point would not meet R2, would then hit
  the rule at the bottom of table 72 that marks the whole flow
  (including future packets) as invalid:

       table=72, priority=40,ct_state=+est,ip,reg5=0x8
  actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[]))
  cookie=0xecd4cf28852563cf

  From then, of course, even if R4 is setup later on, traffic would not
  be accepted anymore.

  The fact that the traffic interruption isn't observed on a second
  restart is compatible with the idea that "conjunction ids collision"
  could be the cause of our issue: indeed on a second restart done a
  short time after the first one, the conjunction ID for a given SG flow
  is likely to be the same one as for the previous run.

  ## Rough analysis of  the openvswitch firewall design and the
  conditions under which it may break

  Here are what are my personnal notes on a quick analysis of what races may exist  in the code in relation with conjunction IDs reuse accross restarts.
  My apologies for the quality of redaction, I opted for posting them rather than postponing this to an unknown future date.

  Note that I haven't digged into the code to see which of the possible
  issue above can actually happen, which depends of the order in which
  the openvswitch agent code does updates of flow rules.

  This analysis really isn't thorough and may be wrong in some places.
  Hopefully the existence of any mistake will serve as an illustration
  that the reuse of conjunction IDs across restarts raises difficult
  questions... ;)

  # Notations used:

  SGFlow = traffic allowed between two SGs in a given direction (on a given ethertype)
  (sg_id, remote_sg_id, direction, ethertype)   (tuples used by ConjIdMap)

  SGFlow1 = (SGA, SGB, ingress, icmp)
  SGFlow2 = (SGC, SGD, ingress, icmp)

  Looking at a single compute node...

  Compute has a local port N, with MAC = MAC(N), on network Net(N)
  N belongs to SGA

  conj_id X is allocated by ConjIdMap to SGFlow1 (SGA, SGB, ingress,
  icmp)

  OVS rules for this SGFlow1 are R1b-1, R1b-2 and R1b-3 (R1 for SGFLow1, 'b' like 'before restart'):
  - R1b-1: reg_net = Net(N), ip_src in SGB  => set conjunction(X, 1/2)
  - R1b-2: reg_port = to_port(N), icmp   => set conjunction(X, 2/2)
  - R1b-3: conj(X), to_port(N)                => CT + output to N

  For instance (here conj_id X==11):
  R1b-1: cookie=0x5403be965377d469, duration=9682.923s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,ip,reg6=0x1,nw_src=10.1.1.236 actions=conjunction(11,1/2)
  R1b-2: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,icmp,reg5=0x6 actions=conjunction(11,2/2)
  R1b-3: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=2, n_bytes=196, idle_age=9612, priority=71,conj_id=11,ct_state=+new-est,ip,reg5=0x6 actions=load:0xb->NXM_NX_REG7[],ct(commit,zone=NXM_NX_REG6[0..15]),output:6,resubmit(,92)

  Compute also has a local port K with MAC = MAC(K), on network Net(K)
  K belongs to SGC

  conj_id Y allocated by ConjIdMap to SGFlow2 (SGC, SGD, ingress, icmp)

  OVS rules:
  R2b-1: reg_net = Net(K), ip_src in SGD  => set conjunction(Y, 1/2)
  R2b-2: reg_port = to_port(K), icmp   => set conjunction(Y, 2/2)
  R2b-3: conj(Y), to_port(K)                => CT + output to K

  ** Let's now assume that the OVS agent is now restarted **

  [Scenario 1]
  We assume that nothing has been done about port K and SGFlow2 at this point.

  SGFlow1 processing starts for port N:
  conj_id *Y* allocated by ConjIdMap to SGFlow1
  (case of a reuse of a given conjunction id for two different flows, here SGFlow1 gets assigned the conjunction id that was the conj_id of SGFlow2 before the restart)

  new OVS rules ('a' like 'after restart'):
  R1a-1: reg_net = Net(N), ip_src (in SGB)  => set conjunction(Y, 1/2)
  R1a-2: reg_port = to_port(N), icmp   => set conjunction(Y, 2/2)
  R1a-3: conj(Y), to_port(N)                => CT + output to N

  if R1a-1/2 is setup before R1-a3, we have:
  * R1a-1/R1a-2 mapping SGFlow1 to conjunction Y     _replacing_  R1b-1/R1b-2
  * R2b-3 mapping  conj(Y), to_port(K)  to K  (ie. old, wrong port)  is still here
  * Hence no flow is mapping SGFlow1 to port (N) anymore  -> any packet processed now would be dropped and the flow marked invalid (unless port K == port N)
  * after R1a-3 is setup:  SGFlow1 -> to port N , and now traffic can flow again, well, unless the whole connection has been marked invalid)

  if R1a-3 is setup before R1a-1/2 are, until R1a-1/2 are setup, we would have SGFLow2 mapped to Y (old state prior to restart), and rules in place so that traffic matching "to_port(N)+ conj Y" send to port N.
  In the general case, traffic matching SGFLow2 will not also match to_port(N) conditions, but this is messy to analyze further an be sure that no illegitimate traffic can go through.

  After theses potentials races internal to the processing of SGFlow1,
  let's look at the intermediate state we have before the agent starts
  to add rules for SGFlow2?

  Do the matches of the rules overlap ?  (R1a-* + R2b-* + R1b-3 which hasn't been replaced)
  - R..-1: dest ports would have to be on same net (net(N)=net(K))+ same IP in remote SG => possible if remote SGB and SGD overlap
     => in that case R1a-1 has overwritten R2b-1 and conj(Y) 1/2 is computed based on SGFlow1
  - R..-2: same port (MAC) and same ethertype => unlikely but not impossible to have same MAC
  - R..-3: no collision seems possible unless K == N

  ... what effect of a collision of R1a-1 and R2b-1 ?

  SGFlow1 remote SG matching has replaced SGFlow2 remote SG match in the computation of Y.
  But mapping SGFlow2 to Y still exists for 2/2, so SGFlow1 remote IP (1/2) + SGFlow2 dest port (K) match + protocol K still works.
  Traffic that was selected thanks to R2b-1 (remote IP in SGD) is not accepted anymore  (for any remote IP present in SGD but not in SGB), if these flows apply to same Neutron network.

  Here my analysis really is incomplete due to the fact that when remote
  SG IP is in multiple SGs, it's mapped to multiple conjunctions.

  Now, even with no overlap...

  New rules for conj(Y) select traffic for SGFlow1, and based on this
  R1a-3 forwards to N (ok) and R2b-3 forwards to K, which would be a
  firewall violation, although it seems that to leverage this to inject
  traffic the destination MAC would have to be tailored.

  The last questions which I did not dig at all, are:
  • does something interesting happen when flow rules for SGFlow2 are added ?
  • are inconsistencies possible when flow rules for tboh SGFlow1 and SGFlow2 have been added, but before the old flow rules with old cookies have been purged ?

  ##  Conclusion

  My impression is that the behavior of OVS firewall in neutron agent
  during a restart is very hard to predict because nothing prevents a
  given conjunction IDs to be reused for different SGs creating
  interferences between the old rules and the new rules during the
  window where they co-exist.  The inconsistency may perhaps exist only
  for a short duration, but due to how connection tracking is done, the
  effects of a packet hitting the pipeline at a moment where rules are
  inconsistent, can perdure beyond the short time where rules are
  inconsistent.   It is also worth noting that we've hit a case where
  the duration of the restart (30s) and the amount of flows (6k) is
  large enough to have created these conditions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1934917/+subscriptions