fuel-dev team mailing list archive

Thread
Date

Re: Release blocker: Moving management vip breaks rabbitmq sessions

To: Mike Scherbakov <mscherbakov@xxxxxxxxxxxx>
From: Dmitry Borodaenko <dborodaenko@xxxxxxxxxxxx>
Date: Sat, 1 Mar 2014 02:46:39 -0800
Cc: "fuel-dev@xxxxxxxxxxxxxxxxxxx" <fuel-dev@xxxxxxxxxxxxxxxxxxx>, Dina Belova <dbelova@xxxxxxxxxxxx>
In-reply-to: <CAKYN3rOVMDO2QverwyaAg3AqWu3TwmYWJ9bnSLnTHO+4v+fphQ@mail.gmail.com>

The solution we have consists of 3 parts:

1) Reconfigure OpenStack services to bypass HAProxy and connect to
RabbitMQ directly on the controllers. Our testing shows that this
actually resolves the RabbitMQ side of the problem.

I'm working on a fuel-library patch that will do that, should be
mostly straight-forward except for working around all the code
duplication and hardcoded values in different puppet modules. An
action item for EU timezone that I think would be most helpful is to
test the proposed configuration (rabbit_hosts=<controller-1 mgmt
ip>:5672,<controller-2 mgmt ip>:5672, etc.) in as many different
failover and vip move scenarios as possible.

One more thing I'm considering that would be worth testing is to see
if it would be even better to point the controller services to
rabbit_hosts=127.0.0.1:5672, and leave only compute and other
non-controller nodes with the enumeration of controller management IPs
in rabbit_hosts.

2) Enable flush_routes option for management and public VIPs in crm
configuration, and restart HAProxy via crm when vip moves (including
after failover). Our testing shows that these two actions reduce the
probability of services locking up waiting for a read syscall to time
out on a hung MySQL connection.

3) Upgrade python-mysqldb to version 1.2.5 as requested in OSCI-1105,
and modify mysql connection strings to include read_timeout=90 (I'm
open to suggestions for the timeout value, since it drives the
duration of possible service outage after failover it should
definitely be lower than the Linux kernel default of 10 minutes, but
long enough not to drop connections due to slow SQL queries). This is
something I can't do without help from OSCI team: we need deb and rpm
packages built and tested so that we know they're safe to include in
4.1, and can test them in combination with the other fixes.

Thanks,
-DmitryB


On Sat, Mar 1, 2014 at 12:47 AM, Mike Scherbakov
<mscherbakov@xxxxxxxxxxxx> wrote:
> Folks,
> what is the current status on this? I saw a few comments in bug, but
> wondering about action items European timezone can take on Monday to
> continue the path.
>
> Thanks,
>
>
> On Fri, Feb 28, 2014 at 9:58 PM, Dmitry Borodaenko
> <dborodaenko@xxxxxxxxxxxx> wrote:
>>
>> Dear all,
>>
>> Please make sure that all discussions that occur elsewhere (this ML
>> thread, chats, etc.) end up reflected in the LaunchPad bug (even if a
>> theory is discussed and then eliminated, it's useful to have it
>> mentioned in the bug so that other people don't repeat the same line
>> of investigation). I originally emailed fuel-dev@ to only attract
>> attention to the problem, I did not intend to split the discussion.
>>
>> Thanks,
>>
>> On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn
>> <mmosesohn@xxxxxxxxxxxx> wrote:
>> > I started reaching out to our community folks, Dina and Dmitry.
>> >
>> > We tried a few variations, but the same result: nova and cinder
>> > dislike having the AMQP backend shifted from underneath it.
>> >
>> > If we remove haproxy and connect directly to RabbitMQ on a virtual IP,
>> > all nova and cinder services die when we shift the virtual IP to
>> > another node. Neutron somehow survives and reconnects in about 25
>> > seconds and picks up where it left off.
>> >
>> > For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov
>> > asked me to provide a diff of what the RPC code is between neutron and
>> > cinder to maybe determine why Neutron can resume connections, but
>> > Cinder surely doesn't. Here is this diff:
>> > http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/
>> >
>> > For more info:
>> > Errors we see in Cinder logs:
>> > http://pastie.org/private/w8iigjzijfczvsw5ddelwq
>> > Errors we see in Neutron logs:
>> > http://pastie.org/private/uelxryhbr42jijip0loe2w
>> >
>> > In the bug, mentioned earlier in this thread, we have a diagnostic
>> > snapshot.
>> >
>> > We're still digging for leads to fix this HA failover issue.
>> >
>> > -Matthew
>> >
>> > On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx>
>> > wrote:
>> >> It will not help if you shut down the controller. The problem is that
>> >> you
>> >> have  hanged AMQP sessions which kombu driver does not look to handle
>> >> correctly.
>> >>
>> >>
>> >> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya
>> >> <bdobrelia@xxxxxxxxxxxx>
>> >> wrote:
>> >>>
>> >>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote:
>> >>> > Team,
>> >>> >
>> >>> > Me and Ryan have spent all day investigating
>> >>> > https://bugs.launchpad.net/fuel/+bug/1285449
>> >>> >
>> >>> > What we have found so far confirms that this is a critical bug that
>> >>> > absolutely must be resolved before 4.1 is released.  I have
>> >>> > documented
>> >>> > our findings in the bug comments, someone please take over the
>> >>> > investigation when you come to the office tomorrow morning MSK time.
>> >>> >
>> >>> > I have a feeling that once the root cause is found, the fix will be
>> >>> > low-impact and will involve either change in HAProxy configuration
>> >>> > for
>> >>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something similar.
>> >>> > But first we need to understand what exactly breaks, and why this
>> >>> > only
>> >>> > affects some services and not all of them.
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>>
>> >>> Here is recent rabbitMQ discussion quote from the
>> >>> Fuel-conductors-support team skype chat (RU + translation):
>> >>>
>> >>> Wednesday, February 26, 2014
>> >>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть:
>> >>> (I have a question)
>> >>>
>> >>> listen rabbitmq-openstack
>> >>>   bind 192.168.0.2:5672
>> >>>   balance  roundrobin
>> >>>
>> >>>   server  node-1 192.168.0.3:5673   check inter 5000 rise 2 fall 3
>> >>>   server  node-2 192.168.0.4:5673   check inter 5000 rise 2 fall 3
>> >>> backup
>> >>>   server  node-3 192.168.0.5:5673   check inter 5000 rise 2 fall 3
>> >>> backup
>> >>>
>> >>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и
>> >>> active-passive?
>> >>> (Why do we use roundrobin and active-passive at once for RabbitMQ?)
>> >>>
>> >>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался
>> >>> (To make sure the connection wouldn't break)
>> >>>
>> >>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго в
>> >>> виде мастер-слейв
>> >>> (RabbitMQ clustering is restricted to master-slave only)
>> >>>
>> >>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то нода
>> >>> с
>> >>> запросом к слейву придет - та его на мастер отправит
>> >>> (Hence, any node's query to the RabbitMQ slave would have been re-sent
>> >>> to the master)
>> >>>
>> >>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси
>> >>> всегда всех посылал на одну ноду
>> >>> (Thats why HAproxy always redirects all queries to the single RabbitMQ
>> >>> node)
>> >>>
>> >>> And I'm not clear with this explanation, honestly. Why couldn't we
>> >>> make
>> >>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ
>> >>> nodes
>> >>> skipping HAproxy at all? (because of this: "any node's query to the
>> >>> RabbitMQ slave would have been re-sent to the master")
>> >>>
>> >>> Could that resolve the issue? I think I will investigate this option
>> >>> as
>> >>> well.
>> >>>
>> >>>
>> >>> --
>> >>> Best regards,
>> >>> Bogdan Dobrelya,
>> >>> Skype #bogdando_at_yahoo.com
>> >>> Irc #bogdando
>> >>>
>> >>> --
>> >>> Mailing list: https://launchpad.net/~fuel-dev
>> >>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> >>> Unsubscribe : https://launchpad.net/~fuel-dev
>> >>> More help   : https://help.launchpad.net/ListHelp
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Yours Faithfully,
>> >> Vladimir Kuklin,
>> >> Senior Deployment Engineer,
>> >> Mirantis, Inc.
>> >> +7 (495) 640-49-04
>> >> +7 (926) 702-39-68
>> >> Skype kuklinvv
>> >> 45bk3, Vorontsovskaya Str.
>> >> Moscow, Russia,
>> >> www.mirantis.com
>> >> www.mirantis.ru
>> >> vkuklin@xxxxxxxxxxxx
>> >>
>> >> --
>> >> Mailing list: https://launchpad.net/~fuel-dev
>> >> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> >> Unsubscribe : https://launchpad.net/~fuel-dev
>> >> More help   : https://help.launchpad.net/ListHelp
>> >>
>> >
>> > --
>> > Mailing list: https://launchpad.net/~fuel-dev
>> > Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> > Unsubscribe : https://launchpad.net/~fuel-dev
>> > More help   : https://help.launchpad.net/ListHelp
>>
>>
>>
>> --
>> Dmitry Borodaenko
>>
>> --
>> Mailing list: https://launchpad.net/~fuel-dev
>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~fuel-dev
>> More help   : https://help.launchpad.net/ListHelp
>
>
>
>
> --
> Mike Scherbakov
> #mihgen



-- 
Dmitry Borodaenko

Follow ups

Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Roman Alekseenkov, 2014-03-03
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Dmitry Borodaenko, 2014-03-02

References

Release blocker: Moving management vip breaks rabbitmq sessions
From: Dmitry Borodaenko, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Bogdan Dobrelya, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Vladimir Kuklin, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Matthew Mosesohn, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Dmitry Borodaenko, 2014-02-28
Re: Release blocker: Moving management vip breaks rabbitmq sessions
From: Mike Scherbakov, 2014-03-01