yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #86535
[Bug 1934666] [NEW] IPv6 not set on VMs when using DVR HA + SLAAC/SLAAC (or DHCPv6 stateless)
Public bug reported:
Hi, I think I've encountered a bug in DVR HA + IPv6 SLAAC, but I'd
greatly appreciate it if someone could confirm it on their infra (in
case I've made a mistake in my Devstack configs). I also have a
suggestion regarding Tempest blacklist/whitelist, as tests that could
catch this issue are not running on the affected L3 config.
My setup:
- multinode Devstack Stein (OVS) - contr + 2 computes
- DVR HA router
- private subnet SLAAC/SLAAC (note: DHCPv6 stateless is also affected since it also uses radvd), network type vxlan
- 2 instances on separate nodes
I'm using
tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os
to test this.
There seems to be an issue with setting IPv6 address on the instance
when using a SLAAC/SLAAC subnet (and by extension, DHCPv6 stateless
subnet) + DVR HA router. The IPv6 is set only on the instance that's
placed on the same node as the Master router. Instances placed on nodes
with Backup router have no IPv6 set on their interfaces.
This issue happens only to DVR HA routers. Legacy, legacy HA and DVR no-
HA work correctly.
radvd is running only on the node with Master router (as it should), in the qrouter namespace. RAs reach the local VM via qr if -> tap if (I think?). RAs also manage to reach other nodes via br-tun, but I think I see them dropped by a flow in br-int on destination nodes. They don't reach the tap interfaces.
So the traffic maybe goes like this: (?)
qr if in qrouter ns -> br-int -> br-tun -> RAs reach the other node -> br-tun -> br-int (and get dropped here, I think in table 1)
The path these RAs take in br-tun (on dest node) seem to be slightly
different depending on the router type (legacy HA vs DVR HA). I have no
idea if this is important, but I'm dropping this here just in case:
Legacy HA:
RAs enter table 9 (path from table 0 to 9 is identical in both cases). They are resubmitted to table 10, then go through a learn flow that also outputs traffic to br-int:
cookie=0xe3d0b08836a75945, duration=88239.681s, table=10,
n_packets=8649, n_bytes=528332, priority=1
actions=learn(table=20,hard_timeout=300,priority=1,cookie=0xe3d0b08836a75945,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output
:"patch-int"
DVR HA:
RAs enter table 9. They go through a flow that outputs traffic to br-int:
cookie=0xe3d0b08836a75945, duration=88218.262s, table=9, n_packets=600,
n_bytes=56004, priority=1,dl_src=fa:16:3f:68:d6:f0 actions=output
:"patch-int"
One note - radvd complains that forwarding flag in qrouter ns is set to 0 with DVR HA (no complains in logs for other router types). I've checked flags for forwarding and accept_ra, set some to 2 as per Pike docs (I couldn't find anything more recent), but it didn't help.
Another note - the code seems to be setting these flags in the namespace where HA interface resides. This means that they're set in the snat namespace when using DVR HA, but radvd is still running in qrouter ns. Is it intended to be like this?
As for my suggestion:
I've noticed that the Zuul job responsible for multinode-dvr-ha-full is not running the tests that could catch this issue, namely:
tempest.scenario.test_network_v6.TestGettingAddress
These tests are ran on the ipv6-only job, but since it's set to use legacy routers by default, everything works fine there.
I propose to add these tests to the multinode-dvr-ha-full job to test IPv6 + SLAAC with DVR HA routers. (I have no idea where these test lists are generated for Zuul jobs, so I don't know in which repo the patch should be made)
Since Tempest uses tags, it'd be done for Master only.
I've tested this on Stein, but I've quickly stacked Master multinode
(using OVS) to check and I think it's happening there as well.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: dvr ha ipv6 ra radvd slaac
** Description changed:
Hi, I think I've encountered a bug in DVR HA + IPv6 SLAAC, but I'd
greatly appreciate it if someone could confirm it on their infra (in
case I've made a mistake in my Devstack configs). I also have a
suggestion regarding Tempest blacklist/whitelist, as tests that could
catch this issue are not running on the affected L3 config.
My setup:
- multinode Devstack Stein (OVS) - contr + 2 computes
- DVR HA router
- - private subnet SLAAC/SLAAC (note: DHCPv6 stateless is also affected since it also uses radvd)
+ - private subnet SLAAC/SLAAC (note: DHCPv6 stateless is also affected since it also uses radvd), network type vxlan
- 2 instances on separate nodes
I'm using
tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os
to test this.
There seems to be an issue with setting IPv6 address on the instance
when using a SLAAC/SLAAC subnet (and by extension, DHCPv6 stateless
subnet) + DVR HA router. The IPv6 is set only on the instance that's
placed on the same node as the Master router. Instances placed on nodes
with Backup router have no IPv6 set on their interfaces.
This issue happens only to DVR HA routers. Legacy, legacy HA and DVR no-
HA work correctly.
radvd is running only on the node with Master router (as it should), in the qrouter namespace. RAs reach the local VM via qr if -> tap if (I think?). RAs also manage to reach other nodes via br-tun, but I think I see them dropped by a flow in br-int on destination nodes. They don't reach the tap interfaces.
So the traffic maybe goes like this: (?)
qr if in qrouter ns -> br-int -> br-tun -> RAs reach the other node -> br-tun -> br-int (and get dropped here, I think in table 1)
The path these RAs take in br-tun (on dest node) seem to be slightly
different depending on the router type (legacy HA vs DVR HA). I have no
idea if this is important, but I'm dropping this here just in case:
Legacy HA:
RAs enter table 9 (path from table 0 to 9 is identical in both cases). They are resubmitted to table 10, then go through a learn flow that also outputs traffic to br-int:
cookie=0xe3d0b08836a75945, duration=88239.681s, table=10,
n_packets=8649, n_bytes=528332, priority=1
actions=learn(table=20,hard_timeout=300,priority=1,cookie=0xe3d0b08836a75945,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output
:"patch-int"
DVR HA:
RAs enter table 9. They go through a flow that outputs traffic to br-int:
cookie=0xe3d0b08836a75945, duration=88218.262s, table=9, n_packets=600,
n_bytes=56004, priority=1,dl_src=fa:16:3f:68:d6:f0 actions=output
:"patch-int"
One note - radvd complains that forwarding flag in qrouter ns is set to 0 with DVR HA (no complains in logs for other router types). I've checked flags for forwarding and accept_ra, set some to 2 as per Pike docs (I couldn't find anything more recent), but it didn't help.
Another note - the code seems to be setting these flags in the namespace where HA interface resides. This means that they're set in the snat namespace when using DVR HA, but radvd is still running in qrouter ns. Is it intended to be like this?
As for my suggestion:
I've noticed that the Zuul job responsible for multinode-dvr-ha-full is not running the tests that could catch this issue, namely:
tempest.scenario.test_network_v6.TestGettingAddress
These tests are ran on the ipv6-only job, but since it's set to use legacy routers by default, everything works fine there.
I propose to add these tests to the multinode-dvr-ha-full job to test IPv6 + SLAAC with DVR HA routers. (I have no idea where these test lists are generated for Zuul jobs, so I don't know in which repo the patch should be made)
Since Tempest uses tags, it'd be done for Master only.
I've tested this on Stein, but I've quickly stacked Master multinode
(using OVS) to check and I think it's happening there as well.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1934666
Title:
IPv6 not set on VMs when using DVR HA + SLAAC/SLAAC (or DHCPv6
stateless)
Status in neutron:
New
Bug description:
Hi, I think I've encountered a bug in DVR HA + IPv6 SLAAC, but I'd
greatly appreciate it if someone could confirm it on their infra (in
case I've made a mistake in my Devstack configs). I also have a
suggestion regarding Tempest blacklist/whitelist, as tests that could
catch this issue are not running on the affected L3 config.
My setup:
- multinode Devstack Stein (OVS) - contr + 2 computes
- DVR HA router
- private subnet SLAAC/SLAAC (note: DHCPv6 stateless is also affected since it also uses radvd), network type vxlan
- 2 instances on separate nodes
I'm using
tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os
to test this.
There seems to be an issue with setting IPv6 address on the instance
when using a SLAAC/SLAAC subnet (and by extension, DHCPv6 stateless
subnet) + DVR HA router. The IPv6 is set only on the instance that's
placed on the same node as the Master router. Instances placed on
nodes with Backup router have no IPv6 set on their interfaces.
This issue happens only to DVR HA routers. Legacy, legacy HA and DVR
no-HA work correctly.
radvd is running only on the node with Master router (as it should), in the qrouter namespace. RAs reach the local VM via qr if -> tap if (I think?). RAs also manage to reach other nodes via br-tun, but I think I see them dropped by a flow in br-int on destination nodes. They don't reach the tap interfaces.
So the traffic maybe goes like this: (?)
qr if in qrouter ns -> br-int -> br-tun -> RAs reach the other node -> br-tun -> br-int (and get dropped here, I think in table 1)
The path these RAs take in br-tun (on dest node) seem to be slightly
different depending on the router type (legacy HA vs DVR HA). I have
no idea if this is important, but I'm dropping this here just in case:
Legacy HA:
RAs enter table 9 (path from table 0 to 9 is identical in both cases). They are resubmitted to table 10, then go through a learn flow that also outputs traffic to br-int:
cookie=0xe3d0b08836a75945, duration=88239.681s, table=10,
n_packets=8649, n_bytes=528332, priority=1
actions=learn(table=20,hard_timeout=300,priority=1,cookie=0xe3d0b08836a75945,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output
:"patch-int"
DVR HA:
RAs enter table 9. They go through a flow that outputs traffic to br-int:
cookie=0xe3d0b08836a75945, duration=88218.262s, table=9,
n_packets=600, n_bytes=56004, priority=1,dl_src=fa:16:3f:68:d6:f0
actions=output:"patch-int"
One note - radvd complains that forwarding flag in qrouter ns is set to 0 with DVR HA (no complains in logs for other router types). I've checked flags for forwarding and accept_ra, set some to 2 as per Pike docs (I couldn't find anything more recent), but it didn't help.
Another note - the code seems to be setting these flags in the namespace where HA interface resides. This means that they're set in the snat namespace when using DVR HA, but radvd is still running in qrouter ns. Is it intended to be like this?
As for my suggestion:
I've noticed that the Zuul job responsible for multinode-dvr-ha-full is not running the tests that could catch this issue, namely:
tempest.scenario.test_network_v6.TestGettingAddress
These tests are ran on the ipv6-only job, but since it's set to use legacy routers by default, everything works fine there.
I propose to add these tests to the multinode-dvr-ha-full job to test IPv6 + SLAAC with DVR HA routers. (I have no idea where these test lists are generated for Zuul jobs, so I don't know in which repo the patch should be made)
Since Tempest uses tags, it'd be done for Master only.
I've tested this on Stein, but I've quickly stacked Master multinode
(using OVS) to check and I think it's happening there as well.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1934666/+subscriptions
Follow ups