← Back to team overview

fuel-dev team mailing list archive

Re: Bonding problems

 

Some interesting statistic captured today:

LACP rebalance 10s. Test was running for 3 hours
38416712 packets transmitted, 38415885 received, time 11076175ms
827 pkts or 0,00215% lost
Lost rate: 0.0747 pkt/sec or 1 pkt every 13.4 sec

LACP rebalance 100s. Test was running for 2 hours and a half
30633241 packets transmitted, 30632595 received, time 8822302ms
646 pkts or 0.00211% lost
Lost rate: 0.0732 pkt/sec or 1 pkt every 13.7 sec

LACP rebalance off. Test was running for half an hour
5604637 packets transmitted, 5604486 received, time 1615652ms
151 pkts or 0.0027% lost
Lost rate: 0.0935 pkt/sec or 1 pkt every 10.7 sec

non-LACP balance-slb, rebalance off. Test was running for 47 minutes
9847707 packets transmitted, 9847471 received, time 2872696ms
236 pkts or 0.0024% lost
Lost rate: 0.0822 pkt/sec or 1 pkt every 12.2 sec

Env description:
Two hardware nodes, Controller and compute, are connected with 4 interfaces
each to the Procurve 2510G switch.
CentOS Neutron VLAN cluster is deployed. Public and Private networks are
assigned to the 2-interfaces bonds on each node.
I did a flood ping from an external physical node to the Floating IP
assigned to the VM. So the traffic flows through one bond to the
controller, then through the same bond to the switch and finally through
another bond to the VM.

When I ping a Public IP of the Controller itself, there is no dropped
packets at all. So it seems, drops occurs somewhere inside of
Neutron-related OVS part, and different types of bonding don't cause any
packet drops.



On Wed, Feb 26, 2014 at 2:36 PM, Andrey Danin <adanin@xxxxxxxxxxxx> wrote:

> Night flood ping  through LACP didn't lose any packet.
>
>
> On Wed, Feb 26, 2014 at 12:31 AM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx>wrote:
>
>> Guys, suggested https://review.openstack.org/76345 fix works OK, though
>> it makes impossible to understand patch names :-) So we are waiting for
>> Sergey to provide more human-readable workaround. But we can continue
>> testing with this patch applied to ensure that 1.9.3 downgrade does not
>> introduce any regressions.
>>
>>
>> On Wed, Feb 26, 2014 at 12:11 AM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx>wrote:
>>
>>> Guys, we are testing OVS 1.9.3 on Ubuntu right now. It seems we have
>>> some problems with l23network module:
>>> https://bugs.launchpad.net/fuel/+bug/1284801
>>> We are going to apply a workaround for it. If everything else goes fine,
>>> we are going to move to 1.9.3 as it is OVS LTS version both for CentOS and
>>> Ubuntu.
>>>
>>>
>>> On Tue, Feb 25, 2014 at 11:27 PM, Mike Scherbakov <
>>> mscherbakov@xxxxxxxxxxxx> wrote:
>>>
>>>> Great news!!!
>>>> Andrey, thanks for staying late and waking up early these days in order
>>>> to resolve this. You deserve a good rest. Przmek - thanks for help!
>>>> Documentation would be really needed, otherwise users will be getting back
>>>> to us and complaining that something doesn't work..
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 11:04 PM, Andrey Danin <adanin@xxxxxxxxxxxx>wrote:
>>>>
>>>>> Okay. I finally have learned to set up LACP between OVS and Procurve
>>>>> 2510G. It works fine, like the balance-slb do. I leave my flood ping for
>>>>> the night and will tell you the results tomorrow. It seems we can fly to
>>>>> production with current versions of openvswitch. But here in Moscow we
>>>>> still try to build a fully OVS-1.9.3 ISO and test it.
>>>>>
>>>>> Of course we need to document all the issues properly. As I know
>>>>> Przemek wants to publish a good written examples of OVS, Cisco, Juniper and
>>>>> Arista configs about enabling LACP.
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:12 PM, Mike Scherbakov <
>>>>> mscherbakov@xxxxxxxxxxxx> wrote:
>>>>>
>>>>>> Good news.
>>>>>> Thanks Andrey, keep going!
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:28 PM, Andrey Danin <adanin@xxxxxxxxxxxx>wrote:
>>>>>>
>>>>>>> After 14 hours of a flood ping a hardware lab lost few packets and
>>>>>>> virtual env lost hundreds of packets. Mode: balance-slb.
>>>>>>>
>>>>>>> I'm going to test LACP behaviour today.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:50 AM, Andrey Danin <adanin@xxxxxxxxxxxx>wrote:
>>>>>>>
>>>>>>>> Fine. They wrote about that in the documentation too:
>>>>>>>> http://openvswitch.org/ovs-vswitchd.conf.db.5.pdf page 14 It was
>>>>>>>> introduced two years ago since version 1.5.0. One problem less!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 3:37 AM, Ryan Moe <rmoe@xxxxxxxxxxxx>wrote:
>>>>>>>>
>>>>>>>>> Andrey is correct. It appears that balance-tcp requires successful
>>>>>>>>> LACP negotiation. See here:
>>>>>>>>> https://github.com/osrg/openvswitch/blob/master/lib/bond.c#L610and here:
>>>>>>>>> https://github.com/osrg/openvswitch/blob/master/lib/bond.c#L1438.
>>>>>>>>> This also means that when we create bonds with balance-tcp we need to
>>>>>>>>> configure lacp as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 24, 2014 at 3:14 PM, Andrey Danin <adanin@xxxxxxxxxxxx
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> And yes, the bug https://bugs.launchpad.net/fuel/+bug/1272842and current problem can be unrelated but they have similar error messages
>>>>>>>>>> in OVS logs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 2:55 AM, Andrey Danin <
>>>>>>>>>> adanin@xxxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>> Guys, I set up hardware (2 nodes) and software (3 nodes) labs
>>>>>>>>>>> today with ISO #181 to test bonding. Unfortunately, balance-tcp mode is
>>>>>>>>>>> totally broken. When I use it during deployment or switch to it in a
>>>>>>>>>>> working cluster, all traffic stops. Playing with rebalance interval doesn't
>>>>>>>>>>> help.
>>>>>>>>>>> On the contrary, balance-slb works fine. Both Ubuntu (Hhardware
>>>>>>>>>>> nodes) and CentOS (virtual env) works without any traffic lost. I'm running
>>>>>>>>>>> a flooded ping between virtual instances inside of clouds for a night and
>>>>>>>>>>> will check a number of lost packets. Also I want to play with iperf.
>>>>>>>>>>>
>>>>>>>>>>> Next things we can do:
>>>>>>>>>>> * Build an ISO with stable (1.9.3) or newest (2.0.x) version of
>>>>>>>>>>> OVS and play with them. Yesterday we decided to build Ubuntu 12.04 with
>>>>>>>>>>> Debian Sid 1.9.3 version of OVS. There is the ticket about that
>>>>>>>>>>> https://mirantis.jira.com/browse/OSCI-1089 Also Igor built its
>>>>>>>>>>> own version of an ISO with Sid package.
>>>>>>>>>>> * Dump openflow rules in balance-tcp mode and try to fix them.
>>>>>>>>>>> It's hard to do that because Aliens developed their syntax.
>>>>>>>>>>> * Run Igor's tests again and again until balance-slb starts
>>>>>>>>>>> block a traffic. Then dig into openflow rules.
>>>>>>>>>>> * Play with LACP on a real hardware. Maybe balance-tcp can be
>>>>>>>>>>> used only with lacp=active.
>>>>>>>>>>> * Ask the openvswitch community about our problems.
>>>>>>>>>>>
>>>>>>>>>>> Andrew, yes, the PXE network still nailed to an interface. I
>>>>>>>>>>> hope we will fix it in 5.0.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:20 AM, Igor Shishkin <
>>>>>>>>>>> ishishkin@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello, Dmitry.
>>>>>>>>>>>>
>>>>>>>>>>>> It’s 100% reproducible on virtual environment when we’re trying
>>>>>>>>>>>> to deploy bonding in balance tcp or balance slb mode.
>>>>>>>>>>>> Tests related as a way to reproduce and a warning why these
>>>>>>>>>>>> tests should fail when they’ll be merged.
>>>>>>>>>>>>
>>>>>>>>>>>> As we can see problem is in rebalance procedure openvswitch
>>>>>>>>>>>> tries to do since it started bonded interface. And in this time bonded
>>>>>>>>>>>> interfaces stops to accept ARPs.
>>>>>>>>>>>>
>>>>>>>>>>>> I just built openvswitch=1.9.3 which is LTS and wanna try it in
>>>>>>>>>>>> the same case and try to descrease bond-rebalance-interval to 0(as Andrey
>>>>>>>>>>>> K. suggested). If any of this will help - this could be the solution(but
>>>>>>>>>>>> I'm really not sure bond-rebalance-interval=0 is a good way).
>>>>>>>>>>>> —
>>>>>>>>>>>> Igor Shishkin
>>>>>>>>>>>> QA Engineer
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 24 Feb 2014, at 23:59, Dmitry Borodaenko <
>>>>>>>>>>>> dborodaenko@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> > Mike, Igor,
>>>>>>>>>>>> >
>>>>>>>>>>>> > Can you provide more details on how the integration test in
>>>>>>>>>>>> review
>>>>>>>>>>>> > #75161 helps to reproduce bug #1272842?
>>>>>>>>>>>> >
>>>>>>>>>>>> > As far as I understand, the bug is a highly intermittent
>>>>>>>>>>>> problem with
>>>>>>>>>>>> > ARP that was only showing up after an environment with LACP
>>>>>>>>>>>> bonding
>>>>>>>>>>>> > was operational for at least a few hours.
>>>>>>>>>>>> >
>>>>>>>>>>>> > On the other hand, the problem Igor is reporting based on the
>>>>>>>>>>>> > integration test sounds like something 100% reproducible that
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> > require real hardware or LACP and is not necessarily related
>>>>>>>>>>>> to ARP.
>>>>>>>>>>>> >
>>>>>>>>>>>> > Are you sure you're not confusing two unrelated problems?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>> > -DmitryB
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Mon, Feb 24, 2014 at 9:18 AM, Mike Scherbakov
>>>>>>>>>>>> > <mscherbakov@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>> >> The issue is here:
>>>>>>>>>>>> https://bugs.launchpad.net/fuel/+bug/1272842.
>>>>>>>>>>>> >> Those who know what can be wrong with our
>>>>>>>>>>>> openvswitch/kernel, please provide
>>>>>>>>>>>> >> your input..
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> On Mon, Feb 24, 2014 at 9:04 PM, Igor Shishkin <
>>>>>>>>>>>> ishishkin@xxxxxxxxxxxx>
>>>>>>>>>>>> >> wrote:
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Hello,
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Currently we have this review
>>>>>>>>>>>> https://review.openstack.org/#/c/75161 with
>>>>>>>>>>>> >>> test cases for our brand new shiny bonding feature but
>>>>>>>>>>>> >>> balance-tcp/balance-slb modes are not working for now.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Steps to reproduce are very simple:
>>>>>>>>>>>> >>> Create cluster with simple or HA configuration, select
>>>>>>>>>>>> balance-tcp or
>>>>>>>>>>>> >>> balance-slb bonding mode and start deployment.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Deployment will not finish with success because of
>>>>>>>>>>>> rebalance procedure
>>>>>>>>>>>> >>> problems.
>>>>>>>>>>>> >>> --
>>>>>>>>>>>> >>> Igor Shishkin
>>>>>>>>>>>> >>> QA Engineer
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> --
>>>>>>>>>>>> >> Mike Scherbakov
>>>>>>>>>>>> >> #mihgen
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> --
>>>>>>>>>>>> >> Mailing list: https://launchpad.net/~fuel-dev
>>>>>>>>>>>> >> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>> >> Unsubscribe : https://launchpad.net/~fuel-dev
>>>>>>>>>>>> >> More help   : https://help.launchpad.net/ListHelp
>>>>>>>>>>>> >>
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > --
>>>>>>>>>>>> > Dmitry Borodaenko
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Mailing list: https://launchpad.net/~fuel-dev
>>>>>>>>>>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>>>>>>>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Andrey Danin
>>>>>>>>>>> adanin@xxxxxxxxxxxx
>>>>>>>>>>> skype: gcon.monolake
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Andrey Danin
>>>>>>>>>> adanin@xxxxxxxxxxxx
>>>>>>>>>> skype: gcon.monolake
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Mailing list: https://launchpad.net/~fuel-dev
>>>>>>>>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>>>>>>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>>>>>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrey Danin
>>>>>>>> adanin@xxxxxxxxxxxx
>>>>>>>> skype: gcon.monolake
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Andrey Danin
>>>>>>> adanin@xxxxxxxxxxxx
>>>>>>> skype: gcon.monolake
>>>>>>>
>>>>>>> --
>>>>>>> Mailing list: https://launchpad.net/~fuel-dev
>>>>>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>>>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Scherbakov
>>>>>> #mihgen
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Andrey Danin
>>>>> adanin@xxxxxxxxxxxx
>>>>> skype: gcon.monolake
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mike Scherbakov
>>>> #mihgen
>>>>
>>>> --
>>>> Mailing list: https://launchpad.net/~fuel-dev
>>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>>> More help   : https://help.launchpad.net/ListHelp
>>>>
>>>>
>>>
>>>
>>> --
>>> Yours Faithfully,
>>> Vladimir Kuklin,
>>> Senior Deployment Engineer,
>>> Mirantis, Inc.
>>> +7 (495) 640-49-04
>>> +7 (926) 702-39-68
>>> Skype kuklinvv
>>> 45bk3, Vorontsovskaya Str.
>>> Moscow, Russia,
>>> www.mirantis.com <http://www.mirantis.ru/>
>>> www.mirantis.ru
>>> vkuklin@xxxxxxxxxxxx
>>>
>>
>>
>>
>> --
>> Yours Faithfully,
>> Vladimir Kuklin,
>> Senior Deployment Engineer,
>> Mirantis, Inc.
>> +7 (495) 640-49-04
>> +7 (926) 702-39-68
>> Skype kuklinvv
>> 45bk3, Vorontsovskaya Str.
>> Moscow, Russia,
>> www.mirantis.com <http://www.mirantis.ru/>
>> www.mirantis.ru
>> vkuklin@xxxxxxxxxxxx
>>
>
>
>
> --
> Andrey Danin
> adanin@xxxxxxxxxxxx
> skype: gcon.monolake
>



-- 
Andrey Danin
adanin@xxxxxxxxxxxx
skype: gcon.monolake

References