yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88831
[Bug 1934917] Re: inconsistencies in OVS firewall on an agent restart
Reviewed: https://review.opendev.org/c/openstack/neutron/+/806246
Committed: https://opendev.org/openstack/neutron/commit/ab84b7fb2b6febc9dfd9b0767be90fcb3277c192
Submitter: "Zuul (22348)"
Branch: master
commit ab84b7fb2b6febc9dfd9b0767be90fcb3277c192
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date: Thu Aug 26 16:54:13 2021 +0000
Allow to process FW OF rules belonging to a port in a single operation
This patch adds a new configuration variable to control the OVS
OpenFlow rule processing operations:
* ``openflow_processed_per_port``: by default "False". If enabled,
all OpenFlow rules associated to a port will be processed at once,
in one single transaction. If disabled, the flows will be processed
in batches of "AGENT_RES_PROCESSING_STEP=100" number of OpenFlow
rules.
With ``openflow_processed_per_port`` enabled, all Firewall
OpenFlow rules related to a port are processed in one transaction
(executed in one single command). That ensures the rules are written
atomically and apply all of them at the same time.
That means all needed rules to handle the ingress and egress traffic
of a port using the Open vSwitch Firewall, are committed in the OVS
DB at the same time. That will prevent from partially applied OpenFlow
sets in the Firewall and inconsistencies when applying new SG rules or
during the OVS agent restart.
That will override, if needed, the hard limit of
"AGENT_RES_PROCESSING_STEP=100" OpenFlow rules that could be
processed in OVS at once.
If the default configuration values are not modified, the behaviour of
the OVS library does not change.
Closes-Bug: #1934917
Change-Id: If4984dece266a789d607725f8497f1aac3d73d23
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1934917
Title:
inconsistencies in OVS firewall on an agent restart
Status in neutron:
Fix Released
Status in OpenStack Security Advisory:
Won't Fix
Bug description:
#### Summary
On an pre-production OpenStack deployment, we observed he following
during a restart of neutron-openvswitch-agent: some active flows that
the OVS firewall was letting through based on SG rules before the
restart, become marked as CT_MARK(CT_MARK_INVALID) ; their traffic is
then dropped for a period of time that extends beyond the restart.
The clearing of rules with the previous cookie does not resolve the
issue.
Digging this issue has led me to consider the hypothesis that during a
restart, where neutron OVS agent is adding new rules with a new
cookies and ultimately removing rules from the previous run not
marked with newer cookies, the assumption that the new rules do not
interfere with the old ones was broken.
Looking at how conjunction IDs are used has led me to see that:
A) the code offers no guarantee that, on a restart, a conjunction ID used for some SG rule in the previous run does not end up being used for some other SG rule on the next run
B) in a case where there is an unfortunate collision (same conj_id used for two different SGs over a restart) the way OVS rules are replaced leaves room for race conditions resulting in either legitimate traffic to be dropped or illegitimate traffic to be accepted
(B) with "legitimate traffic to be dropped" matches the issue as we
saw it on the deployment, and the restricted conditions on which (B)
would occur.
This bug report first provides details on the operational issue, but independently of the analysis of this case
the design issue in neutron agent described in the second part is what this bug report really is about.
Slawek and Rodolfo have already been exposed to the details explained
here.
## Details on the issue observed on our deployment
# Context:
- Queens (RH OSP13 containers)
- we focus on two compute nodes where VMs run to form a cluster (one VM per compute)
- SG rules are in places to allow traffic to said VM (more on this below)
- TCP traffic
# Reproduction attempt
neutron OVS agent restarted at 11:41:35 on hyp1001
traffic is impacted (cluster healthcheck failure in the application that runs the VM)
(We hadn't taken much traces for this step, we were only checking that reloading the ovs agent with debug logs was working as intended)
neutron OVS agent restarted at 12:28:35 on hyp1001
no impact on cluster traffic
neutron OVS agent restarted at 12:34:48 on hyp12003
VM impacted starting from "12:35:12" (24s after start of new
agent)
What follows is about this second occurence where traffic was
impacted.
extract from VM logs (redacted):
2021-04-28 12:35:22,769 WARN messages lost for 10.1s
2021-04-28 12:35:32,775 WARN messages lost for 20.0s
When neutron OVS agent restarts, it is supposed to (a) redefine all the OVS rules, and then (b) remove the one that were existing before it's startup.
The time at which the issue starts (12:35:12) is between (a) and (b).
Time of (b) events for the different OVS bridges:
2021-04-28T12:35:29 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-int flows
2021-04-28T12:35:32 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-ovs-cp-bond2 flows
2021-04-28T12:35:32 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-tun flows
Here is the ovs-dpctl dump-flows trace where we see the flows being
marked as CT_MARK_INVALID
> ---------- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows ------------
> --- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:11+0000_dpctl.dropped-flows 2021-05-04 11:08:21.451676203 +0200
> +++ dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows 2021-05-04 11:08:21.455676231 +0200
> @@ -4,6 +4,7 @@
> recirc_id(0),tunnel(tun_id=0x40,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x43,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x44,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x4e,src=172.11.156.14,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x50,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x52,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> @@ -18,3 +19,5 @@
> recirc_id(0),tunnel(tun_id=0xb,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(+new-est-rel-rpl-inv+trk),ct_zone(0x1),ct_mark(0),eth_type(0x0800),ipv4(src=10.92.0.0/255.254.0.0,proto=6,frag=no),tcp(dst=32768/0x8000), flags:S, actions:drop
> recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(-new-est-rel-rpl+inv+trk),eth_type(0x0800),ipv4(frag=no), flags:R., actions:drop
> +recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop
> +recirc_id(0xa503e),in_port(5),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop
>
>
> See the two last dpctl dropped flows that appear between 12:35:11 and 12:35:16 ^^
> Details:
> grep a503[0e] dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:32+0000_dpctl
> recirc_id(0),in_port(5),ct_state(-trk),eth(src=44:ec:ce:33:3f:46,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:13, bytes:1923, used:7.456s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa503e)
> recirc_id(0),in_port(4),ct_state(-trk),eth(src=fa:16:3e:53:6e:1c,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:5886, bytes:724376, used:7.112s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa5030)
> recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), packets:14, bytes:6045, used:7.112s, flags:P., actions:drop
I have dug a bit more, and it precisely appear that the conjunctions
used to match the source IP against our remote SG of interest changes
at a time corresponding to where the packet drop happen.
**Before restart**, our remote group uses conjunction id 38 (this
example focuses on egress direction):
R1: table=72, priority=73,ct_state=+est-rel-
rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(38,1/2)
cookie=0x3321e58c437bf4aa
(same for many other 10.94.x.y IPs)
This conjunction 38 is used to accept traffic for the port:
R2: table=72, priority=73,conj_id=38,ct_state=+est-rel-
rpl,ip,reg5=0x8 actions=load:0x26->NXM_NX_REG7[],resubmit(,73)
cookie=0x3321e58c437bf4aa
(file is
dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:11+0000_ofctl_br-int)
**After the restart**, around the time where the drop start we see the
above rule replaced by an action to set conjunction 86:
R3: table=72, priority=73,ct_state=+est-rel-
rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(86,1/2)
cookie=0xecd4cf28852563cf
actions to accept traffic on the port is also present:
R4: table=72, priority=73,conj_id=86,ct_state=+est-rel-
rpl,ip,reg5=0x8 actions=load:0x56->NXM_NX_REG7[],resubmit(,73)
cookie=0xecd4cf28852563cf
(file is
dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:16+0000_ofctl_br-int)
## Comments on this operational issue
The setup we have uses a lot of conjunction IDs rules for SGs, due to
the use of a large range of TCP ports in a security group rule (that
are broken into multiple masks in OVS flow rules appearing with many
OVS conjunctions).
Given thatthe traffic is dropped based on ct_mark(0x1), I speculate
that, because the transition of the rules from conjunction 38 to
conjunction 86 is not atomic (if I'm right nothing in the OVS firewall
design guarantees that, and I see batches of 100 packets mentioned in
ovs-vswitchd.log), there could perhaps be a time window where R3 has
replaced R1, but R4 is not in place yet. Any packet of our flow
hitting the pipeline at this point would not meet R2, would then hit
the rule at the bottom of table 72 that marks the whole flow
(including future packets) as invalid:
table=72, priority=40,ct_state=+est,ip,reg5=0x8
actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[]))
cookie=0xecd4cf28852563cf
From then, of course, even if R4 is setup later on, traffic would not
be accepted anymore.
The fact that the traffic interruption isn't observed on a second
restart is compatible with the idea that "conjunction ids collision"
could be the cause of our issue: indeed on a second restart done a
short time after the first one, the conjunction ID for a given SG flow
is likely to be the same one as for the previous run.
## Rough analysis of the openvswitch firewall design and the
conditions under which it may break
Here are what are my personnal notes on a quick analysis of what races may exist in the code in relation with conjunction IDs reuse accross restarts.
My apologies for the quality of redaction, I opted for posting them rather than postponing this to an unknown future date.
Note that I haven't digged into the code to see which of the possible
issue above can actually happen, which depends of the order in which
the openvswitch agent code does updates of flow rules.
This analysis really isn't thorough and may be wrong in some places.
Hopefully the existence of any mistake will serve as an illustration
that the reuse of conjunction IDs across restarts raises difficult
questions... ;)
# Notations used:
SGFlow = traffic allowed between two SGs in a given direction (on a given ethertype)
(sg_id, remote_sg_id, direction, ethertype) (tuples used by ConjIdMap)
SGFlow1 = (SGA, SGB, ingress, icmp)
SGFlow2 = (SGC, SGD, ingress, icmp)
Looking at a single compute node...
Compute has a local port N, with MAC = MAC(N), on network Net(N)
N belongs to SGA
conj_id X is allocated by ConjIdMap to SGFlow1 (SGA, SGB, ingress,
icmp)
OVS rules for this SGFlow1 are R1b-1, R1b-2 and R1b-3 (R1 for SGFLow1, 'b' like 'before restart'):
- R1b-1: reg_net = Net(N), ip_src in SGB => set conjunction(X, 1/2)
- R1b-2: reg_port = to_port(N), icmp => set conjunction(X, 2/2)
- R1b-3: conj(X), to_port(N) => CT + output to N
For instance (here conj_id X==11):
R1b-1: cookie=0x5403be965377d469, duration=9682.923s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,ip,reg6=0x1,nw_src=10.1.1.236 actions=conjunction(11,1/2)
R1b-2: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,icmp,reg5=0x6 actions=conjunction(11,2/2)
R1b-3: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=2, n_bytes=196, idle_age=9612, priority=71,conj_id=11,ct_state=+new-est,ip,reg5=0x6 actions=load:0xb->NXM_NX_REG7[],ct(commit,zone=NXM_NX_REG6[0..15]),output:6,resubmit(,92)
Compute also has a local port K with MAC = MAC(K), on network Net(K)
K belongs to SGC
conj_id Y allocated by ConjIdMap to SGFlow2 (SGC, SGD, ingress, icmp)
OVS rules:
R2b-1: reg_net = Net(K), ip_src in SGD => set conjunction(Y, 1/2)
R2b-2: reg_port = to_port(K), icmp => set conjunction(Y, 2/2)
R2b-3: conj(Y), to_port(K) => CT + output to K
** Let's now assume that the OVS agent is now restarted **
[Scenario 1]
We assume that nothing has been done about port K and SGFlow2 at this point.
SGFlow1 processing starts for port N:
conj_id *Y* allocated by ConjIdMap to SGFlow1
(case of a reuse of a given conjunction id for two different flows, here SGFlow1 gets assigned the conjunction id that was the conj_id of SGFlow2 before the restart)
new OVS rules ('a' like 'after restart'):
R1a-1: reg_net = Net(N), ip_src (in SGB) => set conjunction(Y, 1/2)
R1a-2: reg_port = to_port(N), icmp => set conjunction(Y, 2/2)
R1a-3: conj(Y), to_port(N) => CT + output to N
if R1a-1/2 is setup before R1-a3, we have:
* R1a-1/R1a-2 mapping SGFlow1 to conjunction Y _replacing_ R1b-1/R1b-2
* R2b-3 mapping conj(Y), to_port(K) to K (ie. old, wrong port) is still here
* Hence no flow is mapping SGFlow1 to port (N) anymore -> any packet processed now would be dropped and the flow marked invalid (unless port K == port N)
* after R1a-3 is setup: SGFlow1 -> to port N , and now traffic can flow again, well, unless the whole connection has been marked invalid)
if R1a-3 is setup before R1a-1/2 are, until R1a-1/2 are setup, we would have SGFLow2 mapped to Y (old state prior to restart), and rules in place so that traffic matching "to_port(N)+ conj Y" send to port N.
In the general case, traffic matching SGFLow2 will not also match to_port(N) conditions, but this is messy to analyze further an be sure that no illegitimate traffic can go through.
After theses potentials races internal to the processing of SGFlow1,
let's look at the intermediate state we have before the agent starts
to add rules for SGFlow2?
Do the matches of the rules overlap ? (R1a-* + R2b-* + R1b-3 which hasn't been replaced)
- R..-1: dest ports would have to be on same net (net(N)=net(K))+ same IP in remote SG => possible if remote SGB and SGD overlap
=> in that case R1a-1 has overwritten R2b-1 and conj(Y) 1/2 is computed based on SGFlow1
- R..-2: same port (MAC) and same ethertype => unlikely but not impossible to have same MAC
- R..-3: no collision seems possible unless K == N
... what effect of a collision of R1a-1 and R2b-1 ?
SGFlow1 remote SG matching has replaced SGFlow2 remote SG match in the computation of Y.
But mapping SGFlow2 to Y still exists for 2/2, so SGFlow1 remote IP (1/2) + SGFlow2 dest port (K) match + protocol K still works.
Traffic that was selected thanks to R2b-1 (remote IP in SGD) is not accepted anymore (for any remote IP present in SGD but not in SGB), if these flows apply to same Neutron network.
Here my analysis really is incomplete due to the fact that when remote
SG IP is in multiple SGs, it's mapped to multiple conjunctions.
Now, even with no overlap...
New rules for conj(Y) select traffic for SGFlow1, and based on this
R1a-3 forwards to N (ok) and R2b-3 forwards to K, which would be a
firewall violation, although it seems that to leverage this to inject
traffic the destination MAC would have to be tailored.
The last questions which I did not dig at all, are:
• does something interesting happen when flow rules for SGFlow2 are added ?
• are inconsistencies possible when flow rules for tboh SGFlow1 and SGFlow2 have been added, but before the old flow rules with old cookies have been purged ?
## Conclusion
My impression is that the behavior of OVS firewall in neutron agent
during a restart is very hard to predict because nothing prevents a
given conjunction IDs to be reused for different SGs creating
interferences between the old rules and the new rules during the
window where they co-exist. The inconsistency may perhaps exist only
for a short duration, but due to how connection tracking is done, the
effects of a packet hitting the pipeline at a moment where rules are
inconsistent, can perdure beyond the short time where rules are
inconsistent. It is also worth noting that we've hit a case where
the duration of the restart (30s) and the amount of flows (6k) is
large enough to have created these conditions.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1934917/+subscriptions