← Back to team overview

openstack team mailing list archive

Weird nova-network bridging problem with precise/essex

 

We're running into what looks like a linux bridging bug, which causes
both substantial (20-40%) packet loss, and DNS to fail about that same
fraction of the time. We're running essex on precise, with dedicated
nova-network servers and VLANManager. On either of our nova-network
servers, we see the same behavior. When tracking this down, I found
the following, when tcpdump'ing along the path between vm instance and
n-net gateway.

The packets appear to make it to the nova-network server, and are
properly pulled out of dot1q tagging:
root@m5-p:~# tcpdump -K -p -i vlan200 -v -vv udp port 53
tcpdump: WARNING: vlan200: no IPv4 address assigned
tcpdump: listening on vlan200, link-type EN10MB (Ethernet), capture
size 65535 bytes
20:34:02.377711 IP (tos 0x0, ttl 64, id 59761, offset 0, flags [none],
proto UDP (17), length 60)
    10.0.0.3.54937 > 10.0.0.1.domain: 52874+ A? www.google.com. (32)
20:34:07.377942 IP (tos 0x0, ttl 64, id 59762, offset 0, flags [none],
proto UDP (17), length 60)    10.0.0.3.54937 > 10.0.0.1.domain: 52874+
A? www.google.com. (32)
20:34:12.378248 IP (tos 0x0, ttl 64, id 59763, offset 0, flags [none],
proto UDP (17), length 60)    10.0.0.3.54937 > 10.0.0.1.domain: 52874+
A? www.google.com. (32)
20:34:12.378428 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
UDP (17), length 170)    10.0.0.1.domain > 10.0.0.3.54937: 52874 q: A?
www.google.com. 6/0/0 www.google.com. [1d3h55m19s] CNAME
www.l.google.com., www.l.google.com. [1m33s] A 74.125.225.209,
www.l.google.com. [1m33s] A 74.125.225.208, www.l.google.com. [1m33s]
A 74.125.225.212, www.l.google.com. [1m33s] A 74.125.225.211,
www.l.google.com. [1m33s] A 74.125.225.210 (142)

But some packets don't make it all of the way to the bridged interface:
root@m5-p:~# brctl show
bridge name     bridge id               STP enabled     interfaces
br200           8000.fa163e18927b       no              vlan200

root@m5-p:~# tcpdump -K -p -i br200 -v -vv udp port 53
tcpdump: listening on br200, link-type EN10MB (Ethernet), capture size
65535 bytes
20:34:12.378264 IP (tos 0x0, ttl 64, id 59763, offset 0, flags [none],
proto UDP (17), length 60)
    10.0.0.3.54937 > 10.0.0.1.domain: 52874+ A? www.google.com. (32)
20:34:12.378424 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
UDP (17), length 170)
    10.0.0.1.domain > 10.0.0.3.54937: 52874 q: A? www.google.com.
6/0/0 www.google.com. [1d3h55m19s] CNAME www.l.google.com.,
www.l.google.com. [1m33s] A 74.125.225.209, www.l.google.com. [1m33s]
A 74.125.225.208, www.l.google.com. [1m33s] A 74.125.225.212,
www.l.google.com. [1m33s] A 74.125.225.211, www.l.google.com. [1m33s]
A 74.125.225.210 (142)

I can't find any way that ipfilter could be implicated in this; there
aren't deny rules that are hitting.

Oddly enough, this seems to cause no loss in icmp traffic, even with ping -f.

So far, searching hasn't netted very much. I've found this similar
sounding ubuntu bug report, but it looks like no one is working on it:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/986043

We're at 3.2.0-24, and there is a 3.2.0-25, but it is reported to not
fix this issue, and neither are 3.4 kernels.

It seems sad to try backrevving to an onieric kernel, but that is on
my list for tomorrow.  If this is a kernel bug, it will make the
precise default kernel unusable for nova-network servers with dot1q
(or whatever the appropriate feature interaction is).

Does this ring any bells, or is there another course of action I should attempt?
thanks in advance for any suggestions.
 -nld


Follow ups