fuel-dev team mailing list archive

Thread
Date

Re: Release blocker: Moving management vip breaks rabbitmq sessions

To: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
From: Dmitry Borodaenko <dborodaenko@xxxxxxxxxxxx>
Date: Sun, 2 Mar 2014 00:46:09 -0800
Cc: "fuel-dev@xxxxxxxxxxxxxxxxxxx" <fuel-dev@xxxxxxxxxxxxxxxxxxx>, Dina Belova <dbelova@xxxxxxxxxxxx>
In-reply-to: <CAM0pNLPnNvjpXG-+RD0T53fPqwgKZpiYCgtfe8u_JF-R1oWrRQ@mail.gmail.com>

I have finally pushed the first version of the RabbitMQ fix to gerrit:
https://review.openstack.org/77409

I tried to keep changes to a minimum and do no refactoring, but due to
high amount of code duplication and inconsistencies in RabbitMQ
configuration for different OpenStack components the fix turned out
more intrusive than I expected. Please review and test with care.

Please note that the current version of the fix doesn't even fully
cover the scope of part (1) from my plan quoted below:

1a) It doesn't change Neutron configuration, I need Sergey's help with
this. Sergey, you already have a TODO item in
sanitize_neutron_config() that is supposed to do exactly what's needed
here, put a list of controller IPs with port 5673 into
neutron_config[amqp][hosts].

1b) It doesn't change Murano configuration. Murano seems to be using
its own implementation of RabbitMQ based RPC backend instead of an
almost homogenous zoo of impl_kombu implementations used by the rest
of OpenStack. I'm not even sure it has the same reconnect mechanism as
impl_kombu, can anyone from Murano team comment?

I've also made no progress on parts (2) and (3) today (flush_routes
and read_timeout), if there are people willing to work on this on
Sunday in EU timezones, your help would be most welcome.

Thanks,
-DmitryB

On Sat, Mar 1, 2014 at 2:46 AM, Dmitry Borodaenko
<dborodaenko@xxxxxxxxxxxx> wrote:
> The solution we have consists of 3 parts:
>
> 1) Reconfigure OpenStack services to bypass HAProxy and connect to
> RabbitMQ directly on the controllers. Our testing shows that this
> actually resolves the RabbitMQ side of the problem.
>
> I'm working on a fuel-library patch that will do that, should be
> mostly straight-forward except for working around all the code
> duplication and hardcoded values in different puppet modules. An
> action item for EU timezone that I think would be most helpful is to
> test the proposed configuration (rabbit_hosts=<controller-1 mgmt
> ip>:5672,<controller-2 mgmt ip>:5672, etc.) in as many different
> failover and vip move scenarios as possible.
>
> One more thing I'm considering that would be worth testing is to see
> if it would be even better to point the controller services to
> rabbit_hosts=127.0.0.1:5672, and leave only compute and other
> non-controller nodes with the enumeration of controller management IPs
> in rabbit_hosts.
>
> 2) Enable flush_routes option for management and public VIPs in crm
> configuration, and restart HAProxy via crm when vip moves (including
> after failover). Our testing shows that these two actions reduce the
> probability of services locking up waiting for a read syscall to time
> out on a hung MySQL connection.
>
> 3) Upgrade python-mysqldb to version 1.2.5 as requested in OSCI-1105,
> and modify mysql connection strings to include read_timeout=90 (I'm
> open to suggestions for the timeout value, since it drives the
> duration of possible service outage after failover it should
> definitely be lower than the Linux kernel default of 10 minutes, but
> long enough not to drop connections due to slow SQL queries). This is
> something I can't do without help from OSCI team: we need deb and rpm
> packages built and tested so that we know they're safe to include in
> 4.1, and can test them in combination with the other fixes.
>
> Thanks,
> -DmitryB
>
>
> On Sat, Mar 1, 2014 at 12:47 AM, Mike Scherbakov
> <mscherbakov@xxxxxxxxxxxx> wrote:
>> Folks,
>> what is the current status on this? I saw a few comments in bug, but
>> wondering about action items European timezone can take on Monday to
>> continue the path.
>>
>> Thanks,
>>
>>
>> On Fri, Feb 28, 2014 at 9:58 PM, Dmitry Borodaenko
>> <dborodaenko@xxxxxxxxxxxx> wrote:
>>>
>>> Dear all,
>>>
>>> Please make sure that all discussions that occur elsewhere (this ML
>>> thread, chats, etc.) end up reflected in the LaunchPad bug (even if a
>>> theory is discussed and then eliminated, it's useful to have it
>>> mentioned in the bug so that other people don't repeat the same line
>>> of investigation). I originally emailed fuel-dev@ to only attract
>>> attention to the problem, I did not intend to split the discussion.
>>>
>>> Thanks,
>>>
>>> On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn
>>> <mmosesohn@xxxxxxxxxxxx> wrote:
>>> > I started reaching out to our community folks, Dina and Dmitry.
>>> >
>>> > We tried a few variations, but the same result: nova and cinder
>>> > dislike having the AMQP backend shifted from underneath it.
>>> >
>>> > If we remove haproxy and connect directly to RabbitMQ on a virtual IP,
>>> > all nova and cinder services die when we shift the virtual IP to
>>> > another node. Neutron somehow survives and reconnects in about 25
>>> > seconds and picks up where it left off.
>>> >
>>> > For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov
>>> > asked me to provide a diff of what the RPC code is between neutron and
>>> > cinder to maybe determine why Neutron can resume connections, but
>>> > Cinder surely doesn't. Here is this diff:
>>> > http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/
>>> >
>>> > For more info:
>>> > Errors we see in Cinder logs:
>>> > http://pastie.org/private/w8iigjzijfczvsw5ddelwq
>>> > Errors we see in Neutron logs:
>>> > http://pastie.org/private/uelxryhbr42jijip0loe2w
>>> >
>>> > In the bug, mentioned earlier in this thread, we have a diagnostic
>>> > snapshot.
>>> >
>>> > We're still digging for leads to fix this HA failover issue.
>>> >
>>> > -Matthew
>>> >
>>> > On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx>
>>> > wrote:
>>> >> It will not help if you shut down the controller. The problem is that
>>> >> you
>>> >> have  hanged AMQP sessions which kombu driver does not look to handle
>>> >> correctly.
>>> >>
>>> >>
>>> >> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya
>>> >> <bdobrelia@xxxxxxxxxxxx>
>>> >> wrote:
>>> >>>
>>> >>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote:
>>> >>> > Team,
>>> >>> >
>>> >>> > Me and Ryan have spent all day investigating
>>> >>> > https://bugs.launchpad.net/fuel/+bug/1285449
>>> >>> >
>>> >>> > What we have found so far confirms that this is a critical bug that
>>> >>> > absolutely must be resolved before 4.1 is released.  I have
>>> >>> > documented
>>> >>> > our findings in the bug comments, someone please take over the
>>> >>> > investigation when you come to the office tomorrow morning MSK time.
>>> >>> >
>>> >>> > I have a feeling that once the root cause is found, the fix will be
>>> >>> > low-impact and will involve either change in HAProxy configuration
>>> >>> > for
>>> >>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something similar.
>>> >>> > But first we need to understand what exactly breaks, and why this
>>> >>> > only
>>> >>> > affects some services and not all of them.
>>> >>> >
>>> >>> > Thanks,
>>> >>> >
>>> >>>
>>> >>> Here is recent rabbitMQ discussion quote from the
>>> >>> Fuel-conductors-support team skype chat (RU + translation):
>>> >>>
>>> >>> Wednesday, February 26, 2014
>>> >>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть:
>>> >>> (I have a question)
>>> >>>
>>> >>> listen rabbitmq-openstack
>>> >>>   bind 192.168.0.2:5672
>>> >>>   balance  roundrobin
>>> >>>
>>> >>>   server  node-1 192.168.0.3:5673   check inter 5000 rise 2 fall 3
>>> >>>   server  node-2 192.168.0.4:5673   check inter 5000 rise 2 fall 3
>>> >>> backup
>>> >>>   server  node-3 192.168.0.5:5673   check inter 5000 rise 2 fall 3
>>> >>> backup
>>> >>>
>>> >>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и
>>> >>> active-passive?
>>> >>> (Why do we use roundrobin and active-passive at once for RabbitMQ?)
>>> >>>
>>> >>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался
>>> >>> (To make sure the connection wouldn't break)
>>> >>>
>>> >>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго в
>>> >>> виде мастер-слейв
>>> >>> (RabbitMQ clustering is restricted to master-slave only)
>>> >>>
>>> >>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то нода
>>> >>> с
>>> >>> запросом к слейву придет - та его на мастер отправит
>>> >>> (Hence, any node's query to the RabbitMQ slave would have been re-sent
>>> >>> to the master)
>>> >>>
>>> >>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси
>>> >>> всегда всех посылал на одну ноду
>>> >>> (Thats why HAproxy always redirects all queries to the single RabbitMQ
>>> >>> node)
>>> >>>
>>> >>> And I'm not clear with this explanation, honestly. Why couldn't we
>>> >>> make
>>> >>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ
>>> >>> nodes
>>> >>> skipping HAproxy at all? (because of this: "any node's query to the
>>> >>> RabbitMQ slave would have been re-sent to the master")
>>> >>>
>>> >>> Could that resolve the issue? I think I will investigate this option
>>> >>> as
>>> >>> well.
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Best regards,
>>> >>> Bogdan Dobrelya,
>>> >>> Skype #bogdando_at_yahoo.com
>>> >>> Irc #bogdando
>>> >>>
>>> >>> --
>>> >>> Mailing list: https://launchpad.net/~fuel-dev
>>> >>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>> >>> Unsubscribe : https://launchpad.net/~fuel-dev
>>> >>> More help   : https://help.launchpad.net/ListHelp
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Yours Faithfully,
>>> >> Vladimir Kuklin,
>>> >> Senior Deployment Engineer,
>>> >> Mirantis, Inc.
>>> >> +7 (495) 640-49-04
>>> >> +7 (926) 702-39-68
>>> >> Skype kuklinvv
>>> >> 45bk3, Vorontsovskaya Str.
>>> >> Moscow, Russia,
>>> >> www.mirantis.com
>>> >> www.mirantis.ru
>>> >> vkuklin@xxxxxxxxxxxx
>>> >>
>>> >> --
>>> >> Mailing list: https://launchpad.net/~fuel-dev
>>> >> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>> >> Unsubscribe : https://launchpad.net/~fuel-dev
>>> >> More help   : https://help.launchpad.net/ListHelp
>>> >>
>>> >
>>> > --
>>> > Mailing list: https://launchpad.net/~fuel-dev
>>> > Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>> > Unsubscribe : https://launchpad.net/~fuel-dev
>>> > More help   : https://help.launchpad.net/ListHelp
>>>
>>>
>>>
>>> --
>>> Dmitry Borodaenko
>>>
>>> --
>>> Mailing list: https://launchpad.net/~fuel-dev
>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>>
>>
>> --
>> Mike Scherbakov
>> #mihgen
>
>
>
> --
> Dmitry Borodaenko



-- 
Dmitry Borodaenko

References

Release blocker: Moving management vip breaks rabbitmq sessions
From: Dmitry Borodaenko, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Bogdan Dobrelya, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Vladimir Kuklin, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Matthew Mosesohn, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Dmitry Borodaenko, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Mike Scherbakov, 2014-03-01
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Dmitry Borodaenko, 2014-03-01