yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #57298
[Bug 1629539] [NEW] Broken distributed virtual router
Public bug reported:
I wish I could come up with a smarter, more descriptive title for this,
but if someone can after reading my report, feel free to update it.
I installed my second controller the other day (because of resource
constraints, I run ALL my Openstack control services - APIs, Engines,
Servers etc, etc - _everything_ but 'nova-compute' and 'nova-console' -
on one physical host) and then one of my LBaaSv1 (haven't gotten around
to try enabling v2 again, last time I got some issues which was reported
elsewhere in the tracker) stopped working.
After almost a day trying to figure out why only one and how to fix it,
I realized it must be the _router_ not the load balancer that's at fault
(see below).
Broken LBaaSv1 VIP: 10.100.0.16/24
Broken LBaaSv1 Floating IP: 10.0.5.90/24
Working LBaaSv1 Floating IP: 10.0.4.190/24
Router VIF namespace: 10.0.5.100 (not sure exactly what this is, but for some reason it have 'stolen' the "GW functionality" (incoming) on the router from the .253 interfaces)
Router qrouter namespace: 10.0.4.253 + 10.0.5.253 (these are on the 'External Gateway' on the router and is supposed to be the routers GW)
Primary GW/FW/NAT: eth1:192.168.69.1/24, eth2:10.0.4.254/24, eth2:10.0.5.254/24
=> ==========================================
=> From a physical host outside the OS network(s) (i.e. from the 192.168.69.0/24 network):
traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets <= CORRECT
1 192.168.69.1 0.088 ms 0.077 ms 0.064 ms
2 10.0.4.253 0.262 ms 0.246 ms 0.258 ms
3 10.100.0.16 2.365 ms 2.348 ms 2.310 ms
traceroute to 10.0.5.90 (10.0.5.90), 30 hops max, 60 byte packets <= WRONG, LBaaSv1 don't work
1 192.168.69.1 0.156 ms 0.138 ms 0.123 ms
2 10.0.5.100 0.834 ms 0.863 ms 0.851 ms
3 * * *
4 10.0.5.90 1.487 ms 1.564 ms 1.561 ms
traceroute to 10.0.4.190 (10.0.4.190), 30 hops max, 60 byte packets <= WRONG, but LBaaSv1 work
1 192.168.69.1 0.130 ms 0.112 ms 0.097 ms
2 10.0.5.100 1.595 ms 1.581 ms 1.568 ms
3 * * *
4 10.0.4.190 2.265 ms 2.262 ms 2.251 ms
=> ==========================================
=> From an instance (inside the 10.100.0.0/24 subnet - all ICMP open)
traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets
1 * * *
2 * * *
3 *^C
PING 10.100.0.16 (10.100.0.16) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.32 ms
64 bytes from 10.100.0.16: icmp_seq=2 ttl=64 time=0.548 ms
64 bytes from 10.100.0.16: icmp_seq=3 ttl=64 time=0.589 ms
^C
PING 10.0.5.90 (10.0.5.90) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from 10.0.5.90: icmp_seq=1 ttl=60 time=1.68 ms (DUP!)
^C
PING 10.0.4.190 (10.0.4.190) 56(84) bytes of data.
64 bytes from 10.100.0.4: icmp_seq=1 ttl=64 time=0.925 ms
64 bytes from 10.0.4.190: icmp_seq=1 ttl=60 time=467 ms (DUP!)
^C
=> ==========================================
=> The 'actual' problem
=> From a host on the 192.168.69.0/24 network
$ curl --insecure https://10.100.0.16:8140/
curl: (35) Unknown SSL protocol error in connection to 10.100.0.16:8140 <= FAIL, never reaches backend server
$ curl --insecure https://10.0.5.90:8140/
The environment must be purely alphanumeric, not '' <= Actually working
=> From an instance
$ curl --insecure https://10.100.0.16:8140/
The environment must be purely alphanumeric, not '' <= Actually working
$ curl --insecure https://10.0.5.90:8140/
curl: (35) Unknown SSL protocol error in connection to 10.0.5.90:8140 <= FAIL, never reaches backend server
Testing a connection to 10.0.4.190 with curl won't work - it's "ldaps"
on port 636. But doing a ldapsearch from 192.168.69.0/24 to that works,
but not from an instance. So that is broken as well, even though I
labeled it 'working' above :(. Just "broken" in a different way..
=> ==========================================
=> Relevant name spaces on the controllers:
=>
=> Primary Controller
=>
=> ip netns | sort
fip-cd30c1bb-3db6-488c-b448-6cb4454783be
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
66: fg-38e452be-d4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
inet 10.0.5.100/24 brd 10.0.5.255 scope global fg-38e452be-d4
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.5.254 0.0.0.0 UG 0 0 0 fg-38e452be-d4
10.0.4.189 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.190 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.195 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 fg-38e452be-d4
10.0.5.90 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.92 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.99 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 fpr-4b3639a1-8
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
2: rfp-4b3639a1-8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
inet 10.0.5.90/32 brd 10.0.5.90 scope global rfp-4b3639a1-8
inet 10.0.4.190/32 brd 10.0.4.190 scope global rfp-4b3639a1-8
71: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default
inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 rfp-4b3639a1-8
=>
=> Secondary Controller
=>
=> ip netns
snat-4b3639a1-880f-4b55-989f-c6f654e562a7
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
=> snat-4b3639a1-880f-4b55-989f-c6f654e562a7
62: qg-1d52c5b9-4b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet 10.0.4.253/24 brd 10.0.4.255 scope global qg-1d52c5b9-4b
inet 10.0.5.253/24 brd 10.0.5.255 scope global qg-1d52c5b9-4b
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.4.254 0.0.0.0 UG 0 0 0 qg-1d52c5b9-4b
10.0.4.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 sg-ed603ce2-fe
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
51: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default qlen 1000
inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
=> ==========================================
=> The iptables rules in the name spaces
=>
=> Primary Controller
=>
=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 mark match 0x1/0xffff
DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9697
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
=>
=> Secondary Controller
=>
neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
REDIRECT tcp -- 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697
SNAT all -- 10.100.0.25 0.0.0.0/0 to:10.0.5.92
SNAT all -- 10.100.0.16 0.0.0.0/0 to:10.0.5.90
SNAT all -- 10.104.0.44 0.0.0.0/0 to:10.0.5.99
SNAT all -- 10.100.0.3 0.0.0.0/0 to:10.0.4.189
SNAT all -- 10.104.0.27 0.0.0.0/0 to:10.0.4.195
SNAT all -- 10.100.0.4 0.0.0.0/0 to:10.0.4.190
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */
Because the LBaaSv1 worked just fine before I distributed the router
(and the vif and snat name spaces where created) and from what I can
see, all interfaces, routes and iptables rules seems just fine, I can
only deduce that there's something wrong with some of this and I'm
guessing it's with the iptables rules somehow.
But because I don't know how they're (the vif and snat name spaces are
supposed to work, I'm unsure on how to proceed from here.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: distributed name router snat spaces vif
** Description changed:
I wish I could come up with a smarter, more descriptive title for this,
but if someone can after reading my report, feel free to update it.
I installed my second controller the other day (because of resource
constraints, I run ALL my Openstack control services - APIs, Engines,
Servers etc, etc - _everything_ but 'nova-compute' and 'nova-console' -
on one physical host) and then one of my LBaaSv1 (haven't gotten around
to try enabling v2 again, last time I got some issues which was reported
elsewhere in the tracker) stopped working.
After almost a day trying to figure out why only one and how to fix it,
I realized it must be the _router_ not the load balancer that's at fault
(see below).
Broken LBaaSv1 VIP: 10.100.0.16/24
Broken LBaaSv1 Floating IP: 10.0.5.90/24
Working LBaaSv1 Floating IP: 10.0.4.190/24
Router VIF namespace: 10.0.5.100 (not sure exactly what this is, but for some reason it have 'stolen' the "GW functionality" (incoming) on the router from the .253 interfaces)
Router qrouter namespace: 10.0.4.253 + 10.0.5.253 (these are on the 'External Gateway' on the router and is supposed to be the routers GW)
Primary GW/FW/NAT: eth1:192.168.69.1/24, eth2:10.0.4.254/24, eth2:10.0.5.254/24
=> ==========================================
=> From a physical host outside the OS network(s) (i.e. from the 192.168.69.0/24 network):
traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets <= CORRECT
- 1 192.168.69.1 0.088 ms 0.077 ms 0.064 ms
- 2 10.0.4.253 0.262 ms 0.246 ms 0.258 ms
- 3 10.100.0.16 2.365 ms 2.348 ms 2.310 ms
+ 1 192.168.69.1 0.088 ms 0.077 ms 0.064 ms
+ 2 10.0.4.253 0.262 ms 0.246 ms 0.258 ms
+ 3 10.100.0.16 2.365 ms 2.348 ms 2.310 ms
traceroute to 10.0.5.90 (10.0.5.90), 30 hops max, 60 byte packets <= WRONG, LBaaSv1 don't work
- 1 192.168.69.1 0.156 ms 0.138 ms 0.123 ms
- 2 10.0.5.100 0.834 ms 0.863 ms 0.851 ms
- 3 * * *
- 4 10.0.5.90 1.487 ms 1.564 ms 1.561 ms
+ 1 192.168.69.1 0.156 ms 0.138 ms 0.123 ms
+ 2 10.0.5.100 0.834 ms 0.863 ms 0.851 ms
+ 3 * * *
+ 4 10.0.5.90 1.487 ms 1.564 ms 1.561 ms
traceroute to 10.0.4.190 (10.0.4.190), 30 hops max, 60 byte packets <= WRONG, but LBaaSv1 work
- 1 192.168.69.1 0.130 ms 0.112 ms 0.097 ms
- 2 10.0.5.100 1.595 ms 1.581 ms 1.568 ms
- 3 * * *
- 4 10.0.4.190 2.265 ms 2.262 ms 2.251 ms
+ 1 192.168.69.1 0.130 ms 0.112 ms 0.097 ms
+ 2 10.0.5.100 1.595 ms 1.581 ms 1.568 ms
+ 3 * * *
+ 4 10.0.4.190 2.265 ms 2.262 ms 2.251 ms
=> ==========================================
=> From an instance (inside the 10.100.0.0/24 subnet - all ICMP open)
traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets
- 1 * * *
- 2 * * *
- 3 *^C
+ 1 * * *
+ 2 * * *
+ 3 *^C
PING 10.100.0.16 (10.100.0.16) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.32 ms
64 bytes from 10.100.0.16: icmp_seq=2 ttl=64 time=0.548 ms
64 bytes from 10.100.0.16: icmp_seq=3 ttl=64 time=0.589 ms
^C
PING 10.0.5.90 (10.0.5.90) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from 10.0.5.90: icmp_seq=1 ttl=60 time=1.68 ms (DUP!)
^C
PING 10.0.4.190 (10.0.4.190) 56(84) bytes of data.
64 bytes from 10.100.0.4: icmp_seq=1 ttl=64 time=0.925 ms
64 bytes from 10.0.4.190: icmp_seq=1 ttl=60 time=467 ms (DUP!)
^C
=> ==========================================
=> The 'actual' problem
=> From a host on the 192.168.69.0/24 network
$ curl --insecure https://10.100.0.16:8140/
curl: (35) Unknown SSL protocol error in connection to 10.100.0.16:8140 <= FAIL, never reaches backend server
$ curl --insecure https://10.0.5.90:8140/
The environment must be purely alphanumeric, not '' <= Actually working
=> From an instance
$ curl --insecure https://10.100.0.16:8140/
The environment must be purely alphanumeric, not '' <= Actually working
$ curl --insecure https://10.0.5.90:8140/
curl: (35) Unknown SSL protocol error in connection to 10.0.5.90:8140 <= FAIL, never reaches backend server
Testing a connection to 10.0.4.190 with curl won't work - it's "ldaps"
on port 636. But doing a ldapsearch from 192.168.69.0/24 to that works,
but not from an instance. So that is broken as well, even though I
labeled it 'working' above :(. Just "broken" in a different way..
=> ==========================================
=> Relevant name spaces on the controllers:
=>
=> Primary Controller
=>
=> ip netns | sort
fip-cd30c1bb-3db6-488c-b448-6cb4454783be
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
66: fg-38e452be-d4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
- inet 10.0.5.100/24 brd 10.0.5.255 scope global fg-38e452be-d4
+ inet 10.0.5.100/24 brd 10.0.5.255 scope global fg-38e452be-d4
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.5.254 0.0.0.0 UG 0 0 0 fg-38e452be-d4
10.0.4.189 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.190 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.195 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 fg-38e452be-d4
10.0.5.90 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.92 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.99 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 fpr-4b3639a1-8
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
2: rfp-4b3639a1-8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
- inet 10.0.5.90/32 brd 10.0.5.90 scope global rfp-4b3639a1-8
- inet 10.0.4.190/32 brd 10.0.4.190 scope global rfp-4b3639a1-8
+ inet 10.0.5.90/32 brd 10.0.5.90 scope global rfp-4b3639a1-8
+ inet 10.0.4.190/32 brd 10.0.4.190 scope global rfp-4b3639a1-8
71: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default
- inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
+ inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 rfp-4b3639a1-8
=>
=> Secondary Controller
=>
=> ip netns
snat-4b3639a1-880f-4b55-989f-c6f654e562a7
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
=> snat-4b3639a1-880f-4b55-989f-c6f654e562a7
62: qg-1d52c5b9-4b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
- inet 10.0.4.253/24 brd 10.0.4.255 scope global qg-1d52c5b9-4b
- inet 10.0.5.253/24 brd 10.0.5.255 scope global qg-1d52c5b9-4b
+ inet 10.0.4.253/24 brd 10.0.4.255 scope global qg-1d52c5b9-4b
+ inet 10.0.5.253/24 brd 10.0.5.255 scope global qg-1d52c5b9-4b
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.4.254 0.0.0.0 UG 0 0 0 qg-1d52c5b9-4b
10.0.4.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 sg-ed603ce2-fe
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
51: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default qlen 1000
- inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
+ inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
=> ==========================================
=> The iptables rules in the name spaces
=>
=> Primary Controller
=>
=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 mark match 0x1/0xffff
DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9697
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
=>
=> Secondary Controller
=>
neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
REDIRECT tcp -- 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697
SNAT all -- 10.100.0.25 0.0.0.0/0 to:10.0.5.92
SNAT all -- 10.100.0.16 0.0.0.0/0 to:10.0.5.90
SNAT all -- 10.104.0.44 0.0.0.0/0 to:10.0.5.99
SNAT all -- 10.100.0.3 0.0.0.0/0 to:10.0.4.189
SNAT all -- 10.104.0.27 0.0.0.0/0 to:10.0.4.195
SNAT all -- 10.100.0.4 0.0.0.0/0 to:10.0.4.190
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */
-
- Because the LBaaSv1 worked just fine before I distributed the router (and the vif and snat name spaces where created and from what I can see, all interfaces, routes and iptables rules seems just fine, I can only deduce that there's something wrong with some of this and I'm guessing it's with the iptables rules somehow. But because I don't know how they're (the vif and snat name spaces are supposed to work, I'm unsure on how to proceed from here.
+ Because the LBaaSv1 worked just fine before I distributed the router
+ (and the vif and snat name spaces where created) and from what I can
+ see, all interfaces, routes and iptables rules seems just fine, I can
+ only deduce that there's something wrong with some of this and I'm
+ guessing it's with the iptables rules somehow.
+
+ But because I don't know how they're (the vif and snat name spaces are
+ supposed to work, I'm unsure on how to proceed from here.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1629539
Title:
Broken distributed virtual router
Status in neutron:
New
Bug description:
I wish I could come up with a smarter, more descriptive title for
this, but if someone can after reading my report, feel free to update
it.
I installed my second controller the other day (because of resource
constraints, I run ALL my Openstack control services - APIs, Engines,
Servers etc, etc - _everything_ but 'nova-compute' and 'nova-console'
- on one physical host) and then one of my LBaaSv1 (haven't gotten
around to try enabling v2 again, last time I got some issues which was
reported elsewhere in the tracker) stopped working.
After almost a day trying to figure out why only one and how to fix
it, I realized it must be the _router_ not the load balancer that's at
fault (see below).
Broken LBaaSv1 VIP: 10.100.0.16/24
Broken LBaaSv1 Floating IP: 10.0.5.90/24
Working LBaaSv1 Floating IP: 10.0.4.190/24
Router VIF namespace: 10.0.5.100 (not sure exactly what this is, but for some reason it have 'stolen' the "GW functionality" (incoming) on the router from the .253 interfaces)
Router qrouter namespace: 10.0.4.253 + 10.0.5.253 (these are on the 'External Gateway' on the router and is supposed to be the routers GW)
Primary GW/FW/NAT: eth1:192.168.69.1/24, eth2:10.0.4.254/24, eth2:10.0.5.254/24
=> ==========================================
=> From a physical host outside the OS network(s) (i.e. from the 192.168.69.0/24 network):
traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets <= CORRECT
1 192.168.69.1 0.088 ms 0.077 ms 0.064 ms
2 10.0.4.253 0.262 ms 0.246 ms 0.258 ms
3 10.100.0.16 2.365 ms 2.348 ms 2.310 ms
traceroute to 10.0.5.90 (10.0.5.90), 30 hops max, 60 byte packets <= WRONG, LBaaSv1 don't work
1 192.168.69.1 0.156 ms 0.138 ms 0.123 ms
2 10.0.5.100 0.834 ms 0.863 ms 0.851 ms
3 * * *
4 10.0.5.90 1.487 ms 1.564 ms 1.561 ms
traceroute to 10.0.4.190 (10.0.4.190), 30 hops max, 60 byte packets <= WRONG, but LBaaSv1 work
1 192.168.69.1 0.130 ms 0.112 ms 0.097 ms
2 10.0.5.100 1.595 ms 1.581 ms 1.568 ms
3 * * *
4 10.0.4.190 2.265 ms 2.262 ms 2.251 ms
=> ==========================================
=> From an instance (inside the 10.100.0.0/24 subnet - all ICMP open)
traceroute to 10.100.0.16 (10.100.0.16), 30 hops max, 60 byte packets
1 * * *
2 * * *
3 *^C
PING 10.100.0.16 (10.100.0.16) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.32 ms
64 bytes from 10.100.0.16: icmp_seq=2 ttl=64 time=0.548 ms
64 bytes from 10.100.0.16: icmp_seq=3 ttl=64 time=0.589 ms
^C
PING 10.0.5.90 (10.0.5.90) 56(84) bytes of data.
64 bytes from 10.100.0.16: icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from 10.0.5.90: icmp_seq=1 ttl=60 time=1.68 ms (DUP!)
^C
PING 10.0.4.190 (10.0.4.190) 56(84) bytes of data.
64 bytes from 10.100.0.4: icmp_seq=1 ttl=64 time=0.925 ms
64 bytes from 10.0.4.190: icmp_seq=1 ttl=60 time=467 ms (DUP!)
^C
=> ==========================================
=> The 'actual' problem
=> From a host on the 192.168.69.0/24 network
$ curl --insecure https://10.100.0.16:8140/
curl: (35) Unknown SSL protocol error in connection to 10.100.0.16:8140 <= FAIL, never reaches backend server
$ curl --insecure https://10.0.5.90:8140/
The environment must be purely alphanumeric, not '' <= Actually working
=> From an instance
$ curl --insecure https://10.100.0.16:8140/
The environment must be purely alphanumeric, not '' <= Actually working
$ curl --insecure https://10.0.5.90:8140/
curl: (35) Unknown SSL protocol error in connection to 10.0.5.90:8140 <= FAIL, never reaches backend server
Testing a connection to 10.0.4.190 with curl won't work - it's "ldaps"
on port 636. But doing a ldapsearch from 192.168.69.0/24 to that
works, but not from an instance. So that is broken as well, even
though I labeled it 'working' above :(. Just "broken" in a different
way..
=> ==========================================
=> Relevant name spaces on the controllers:
=>
=> Primary Controller
=>
=> ip netns | sort
fip-cd30c1bb-3db6-488c-b448-6cb4454783be
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
66: fg-38e452be-d4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
inet 10.0.5.100/24 brd 10.0.5.255 scope global fg-38e452be-d4
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.5.254 0.0.0.0 UG 0 0 0 fg-38e452be-d4
10.0.4.189 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.190 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.4.195 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 fg-38e452be-d4
10.0.5.90 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.92 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
10.0.5.99 169.254.106.114 255.255.255.255 UGH 0 0 0 fpr-4b3639a1-8
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 fpr-4b3639a1-8
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
2: rfp-4b3639a1-8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
inet 10.0.5.90/32 brd 10.0.5.90 scope global rfp-4b3639a1-8
inet 10.0.4.190/32 brd 10.0.4.190 scope global rfp-4b3639a1-8
71: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default
inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
169.254.106.114 0.0.0.0 255.255.255.254 U 0 0 0 rfp-4b3639a1-8
=>
=> Secondary Controller
=>
=> ip netns
snat-4b3639a1-880f-4b55-989f-c6f654e562a7
qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
=> snat-4b3639a1-880f-4b55-989f-c6f654e562a7
62: qg-1d52c5b9-4b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet 10.0.4.253/24 brd 10.0.4.255 scope global qg-1d52c5b9-4b
inet 10.0.5.253/24 brd 10.0.5.255 scope global qg-1d52c5b9-4b
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.4.254 0.0.0.0 UG 0 0 0 qg-1d52c5b9-4b
10.0.4.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.0.5.0 0.0.0.0 255.255.255.0 U 0 0 0 qg-1d52c5b9-4b
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 sg-ed603ce2-fe
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
51: qr-a2293a4c-51: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc noqueue state UNKNOWN group default qlen 1000
inet 10.100.0.1/24 brd 10.100.0.255 scope global qr-a2293a4c-51
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 qr-a2293a4c-51
=> ==========================================
=> The iptables rules in the name spaces
=>
=> Primary Controller
=>
=> fip-cd30c1bb-3db6-488c-b448-6cb4454783be
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */
=> qrouter-4b3639a1-880f-4b55-989f-c6f654e562a7
neutron-fwaas-l3-INPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-FORWARD all -- 0.0.0.0/0 0.0.0.0/0
neutron-filter-top all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-local all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-scope all -- 0.0.0.0/0 0.0.0.0/0
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 mark match 0x1/0xffff
DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9697
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
DROP all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000000/0xffff0000
=>
=> Secondary Controller
=>
neutron-fwaas-l3-PREROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-OUTPUT all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0
neutron-postrouting-bottom all -- 0.0.0.0/0 0.0.0.0/0
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
DNAT all -- 0.0.0.0/0 10.0.5.92 to:10.100.0.25
DNAT all -- 0.0.0.0/0 10.0.5.90 to:10.100.0.16
DNAT all -- 0.0.0.0/0 10.0.5.99 to:10.104.0.44
DNAT all -- 0.0.0.0/0 10.0.4.189 to:10.100.0.3
DNAT all -- 0.0.0.0/0 10.0.4.195 to:10.104.0.27
DNAT all -- 0.0.0.0/0 10.0.4.190 to:10.100.0.4
REDIRECT tcp -- 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697
SNAT all -- 10.100.0.25 0.0.0.0/0 to:10.0.5.92
SNAT all -- 10.100.0.16 0.0.0.0/0 to:10.0.5.90
SNAT all -- 10.104.0.44 0.0.0.0/0 to:10.0.5.99
SNAT all -- 10.100.0.3 0.0.0.0/0 to:10.0.4.189
SNAT all -- 10.104.0.27 0.0.0.0/0 to:10.0.4.195
SNAT all -- 10.100.0.4 0.0.0.0/0 to:10.0.4.190
neutron-fwaas-l3-float-snat all -- 0.0.0.0/0 0.0.0.0/0
neutron-fwaas-l3-snat all -- 0.0.0.0/0 0.0.0.0/0 /* Perform source NAT on outgoing traffic. */
Because the LBaaSv1 worked just fine before I distributed the router
(and the vif and snat name spaces where created) and from what I can
see, all interfaces, routes and iptables rules seems just fine, I can
only deduce that there's something wrong with some of this and I'm
guessing it's with the iptables rules somehow.
But because I don't know how they're (the vif and snat name spaces are
supposed to work, I'm unsure on how to proceed from here.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1629539/+subscriptions
Follow ups