openstack team mailing list archive

Thread
Date
Re: A Grizzly GRE failure [SOLVED]

To: "openstack@xxxxxxxxxxxxxxxxxxx" <openstack@xxxxxxxxxxxxxxxxxxx>
From: Greg Chavez <greg.chavez@xxxxxxxxx>
Date: Sun, 12 May 2013 10:32:44 -0400
I've had a terrible time getting the community to help me with this
problem.  So special thanks to Darragh O'Reilly and rkeene on
#openstack who was mean and a bit of a wisenheimer (I'd use different
words elsewhere), but at least he talked to me and got me to think
twice about my GRE setup.

But enough of that, problem solved and a bug report has been
submitted: https://bugs.launchpad.net/quantum/+bug/1179223..  I added
an "s" to the front of "persists" in the subject, but whatever.  I
always leave one thing in the hotel room, and I always leave one
embarrassing typo.

Here's the part explaining how it was fixed:

SOLUTION:

mysql> delete from ovs_tunnel_endpoints where id = 1;
Query OK, 1 row affected (0.00 sec)

mysql> select * from ovs_tunnel_endpoints;
+-----------------+----+
| ip_address | id |
+-----------------+----+
| 192.168.239.110 | 3 |
| 192.168.239.114 | 4 |
| 192.168.239.115 | 5 |
| 192.168.239.99 | 2 |
+-----------------+----+
4 rows in set (0.00 sec)

* After doing that, I simply restarted the quantum ovs agents on the
network and compute nodes. The old GRE tunnel is not re-created.
Thereafter, VM network traffic to and from the external network
proceeds without incident.

* Should these tables be cleaned up as well, I wonder:

mysql> select * from ovs_network_bindings;
+--------------------------------------+--------------+------------------+-----------------+
| network_id | network_type | physical_network | segmentation_id |
+--------------------------------------+--------------+------------------+-----------------+
| 4e8aacca-8b38-40ac-a628-18cac3168fe6 | gre | NULL | 2 |
| af224f3f-8de6-4e0d-b043-6bcd5cb014c5 | gre | NULL | 1 |
+--------------------------------------+--------------+------------------+-----------------+
2 rows in set (0.00 sec)

mysql> select * from ovs_tunnel_allocations where allocated != 0;
+-----------+-----------+
| tunnel_id | allocated |
+-----------+-----------+
| 1 | 1 |
| 2 | 1 |
+-----------+-----------+
2 rows in set (0.00 sec)

Cheers, and happy openstacking.  Even you, rkeene!

--Greg Chavez

On Sat, May 11, 2013 at 2:28 PM, Greg Chavez <greg.chavez@xxxxxxxxx> wrote:
> So to be clear:
>
> * I have a three nics on my network node.  The VM traffic goes out the
> 1st nic on 192.168.239.99/24 to the other compute nodes, while
> management traffic goes out the 2nd nic on 192.168.241.99. The 3rd nic
> is external and has no IP.
>
> * I have four GRE endpoints on the VM network, one at the network node
> (192.168.239.99) and three on compute nodes
> (192.168.239.{110,114,115}), all with IDs 2-5.
>
> * I have a fifth GRE endpoint with id 1 to 192.168.241.99, the network
> node's management interface.  This was the first tunnel created when I
> deployed the network node because that is how I set the remote_ip in
> the ovs plugin ini.  I corrected the setting later, but the
> 192.168.241.99 endpoint persists and,  as your response implies, *this
> extraneous endpoint is the cause of my troubles*.
>
> My next question then is what is happening? My guess:
>
> * I ping a guest from the external network using its floater (10.21.166.4).
>
> * It gets NAT'd at the tenant router on the network node to
> 192.168.252.3, at which point an arp request is sent over the unified
> GRE broadcast domain.
>
> * On a compute node, the arp request is received by the VM, which then
> sends a reply to the tenant router's MAC (which I verified with
> tcpdumps).
>
> * There are four endpoints for the packet to go down:
>
>     Bridge br-tun
>         Port br-tun
>             Interface br-tun
>                 type: internal
>         Port "gre-1"
>             Interface "gre-1"
>                 type: gre
>                 options: {in_key=flow, out_key=flow, remote_ip="192.168.241.99"}
>         Port "gre-4"
>             Interface "gre-4"
>                 type: gre
>                 options: {in_key=flow, out_key=flow,
> remote_ip="192.168.239.114"}
>         Port "gre-3"
>             Interface "gre-3"
>                 type: gre
>                 options: {in_key=flow, out_key=flow,
> remote_ip="192.168.239.110"}
>         Port patch-int
>             Interface patch-int
>                 type: patch
>                 options: {peer=patch-tun}
>         Port "gre-2"
>             Interface "gre-2"
>                 type: gre
>                 options: {in_key=flow, out_key=flow, remote_ip="192.168.239.99"}
>
> Here's where I get confused.  Does it know that gre-1 is a different
> broadcast domain than the others, or does is see all endpoints as the
> same domain?
>
> What happens here?  Is this the cause of my network timeouts on
> external connections to the VMs? Does this also explain the sporadic
> nature of the timeouts, why they aren't consistent in frequency or
> duration?
>
> Finally, what happens when I remove the oddball endpoint from the DB?
> Sounds risky!
>
> Thanks for your help
> --Greg Chavez
>
> On Fri, May 10, 2013 at 7:17 PM, Darragh O'Reilly
> <dara2002-openstack@xxxxxxxxx> wrote:
>> I'm not sure how to rectify that. You may have to delete the bad row from the DB and restart the agents:
>>
>> mysql> use quantum;
>> mysql> select * from ovs_tunnel_endpoints;
>> ...
>>
>>On Fri, May 10, 2013 at 6:43 PM, Greg Chavez <greg.chavez@xxxxxxxxx> wrote:
>>>  I'm refactoring my question once again (see  "A Grizzly arping
>>>  failure" and "Failure to arp by quantum router").
>>>
>>>  Quickly, the problem is in a multi-node Grizzly+Raring setup with a
>>>  separate network node and a dedicated VLAN for VM traffic.  External
>>>  connections time out within a minute and dont' resume until traffic is
>>>  initiated from the VM.
>>>
>>>  I got some rather annoying and hostile assistance just now on IRC and
>>>  while it didn't result in a fix, it got me to realize that the problem
>>>  is possibly with my GRE setup.
>>>
>>>  I made a mistake when I originally set this up, assigning the mgmt
>>>  interface of the network node (192.168.241.99) as its GRE remote_ip
>>>  instead if the vm_config network interface (192.168.239.99).  I
>>>  realized my mistake and reconfigured the OVS plugin on the network
>>>  node and moved one.  But now, taking a look at my OVS bridges on the
>>>  network node, I see that the old remote IP is still there!
>>>
>>>      Bridge br-tun
>>>  <snip>
>>>          Port "gre-1"
>>>              Interface "gre-1"
>>>                  type: gre
>>>                  options: {in_key=flow, out_key=flow, remote_ip="192.168.241.99"}
>>>  <snip>
>>>
>>>  This is also on all the compute nodes.
>>>
>>>  ( Full ovs-vsctl show output here: http://pastebin.com/xbre1fNV)
>>>
>>>  What's more, I have this error every time I restart OVS:
>>>
>>>  2013-05-10 18:21:24    ERROR [quantum.agent.linux.ovs_lib] Unable to
>>>  execute ['ovs-vsctl', '--timeout=2', 'add-port', 'br-tun', 'gre-5'].
>>>  Exception:
>>>  Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf',
>>>  'ovs-vsctl', '--timeout=2', 'add-port', 'br-tun', 'gre-5']
>>>  Exit code: 1
>>>  Stdout: ''
>>>  Stderr: 'ovs-vsctl: cannot create a port named gre-5 because a port
>>>  named gre-5 already exists on bridge br-tun\n'
>>>
>>>  Could that be because grep-1 is vestigial and possibly fouling up the
>>>  works by creating two possible paths for VM traffic?
>>>
>>>  Is it as simple as removing it with ovs-vsctl or is something else required?
>>>
>>>  Or is this actually needed for some reason?  Argh... help!