← Back to team overview

fuel-dev team mailing list archive

Re: Release blocker: Moving management vip breaks rabbitmq sessions

 

Dear all,

Please make sure that all discussions that occur elsewhere (this ML
thread, chats, etc.) end up reflected in the LaunchPad bug (even if a
theory is discussed and then eliminated, it's useful to have it
mentioned in the bug so that other people don't repeat the same line
of investigation). I originally emailed fuel-dev@ to only attract
attention to the problem, I did not intend to split the discussion.

Thanks,

On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn
<mmosesohn@xxxxxxxxxxxx> wrote:
> I started reaching out to our community folks, Dina and Dmitry.
>
> We tried a few variations, but the same result: nova and cinder
> dislike having the AMQP backend shifted from underneath it.
>
> If we remove haproxy and connect directly to RabbitMQ on a virtual IP,
> all nova and cinder services die when we shift the virtual IP to
> another node. Neutron somehow survives and reconnects in about 25
> seconds and picks up where it left off.
>
> For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov
> asked me to provide a diff of what the RPC code is between neutron and
> cinder to maybe determine why Neutron can resume connections, but
> Cinder surely doesn't. Here is this diff:
> http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/
>
> For more info:
> Errors we see in Cinder logs: http://pastie.org/private/w8iigjzijfczvsw5ddelwq
> Errors we see in Neutron logs: http://pastie.org/private/uelxryhbr42jijip0loe2w
>
> In the bug, mentioned earlier in this thread, we have a diagnostic snapshot.
>
> We're still digging for leads to fix this HA failover issue.
>
> -Matthew
>
> On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx> wrote:
>> It will not help if you shut down the controller. The problem is that you
>> have  hanged AMQP sessions which kombu driver does not look to handle
>> correctly.
>>
>>
>> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya <bdobrelia@xxxxxxxxxxxx>
>> wrote:
>>>
>>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote:
>>> > Team,
>>> >
>>> > Me and Ryan have spent all day investigating
>>> > https://bugs.launchpad.net/fuel/+bug/1285449
>>> >
>>> > What we have found so far confirms that this is a critical bug that
>>> > absolutely must be resolved before 4.1 is released.  I have documented
>>> > our findings in the bug comments, someone please take over the
>>> > investigation when you come to the office tomorrow morning MSK time.
>>> >
>>> > I have a feeling that once the root cause is found, the fix will be
>>> > low-impact and will involve either change in HAProxy configuration for
>>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something similar.
>>> > But first we need to understand what exactly breaks, and why this only
>>> > affects some services and not all of them.
>>> >
>>> > Thanks,
>>> >
>>>
>>> Here is recent rabbitMQ discussion quote from the
>>> Fuel-conductors-support team skype chat (RU + translation):
>>>
>>> Wednesday, February 26, 2014
>>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть:
>>> (I have a question)
>>>
>>> listen rabbitmq-openstack
>>>   bind 192.168.0.2:5672
>>>   balance  roundrobin
>>>
>>>   server  node-1 192.168.0.3:5673   check inter 5000 rise 2 fall 3
>>>   server  node-2 192.168.0.4:5673   check inter 5000 rise 2 fall 3  backup
>>>   server  node-3 192.168.0.5:5673   check inter 5000 rise 2 fall 3  backup
>>>
>>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и
>>> active-passive?
>>> (Why do we use roundrobin and active-passive at once for RabbitMQ?)
>>>
>>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался
>>> (To make sure the connection wouldn't break)
>>>
>>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго в
>>> виде мастер-слейв
>>> (RabbitMQ clustering is restricted to master-slave only)
>>>
>>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то нода с
>>> запросом к слейву придет - та его на мастер отправит
>>> (Hence, any node's query to the RabbitMQ slave would have been re-sent
>>> to the master)
>>>
>>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси
>>> всегда всех посылал на одну ноду
>>> (Thats why HAproxy always redirects all queries to the single RabbitMQ
>>> node)
>>>
>>> And I'm not clear with this explanation, honestly. Why couldn't we make
>>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ nodes
>>> skipping HAproxy at all? (because of this: "any node's query to the
>>> RabbitMQ slave would have been re-sent to the master")
>>>
>>> Could that resolve the issue? I think I will investigate this option as
>>> well.
>>>
>>>
>>> --
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Skype #bogdando_at_yahoo.com
>>> Irc #bogdando
>>>
>>> --
>>> Mailing list: https://launchpad.net/~fuel-dev
>>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>>
>>
>> --
>> Yours Faithfully,
>> Vladimir Kuklin,
>> Senior Deployment Engineer,
>> Mirantis, Inc.
>> +7 (495) 640-49-04
>> +7 (926) 702-39-68
>> Skype kuklinvv
>> 45bk3, Vorontsovskaya Str.
>> Moscow, Russia,
>> www.mirantis.com
>> www.mirantis.ru
>> vkuklin@xxxxxxxxxxxx
>>
>> --
>> Mailing list: https://launchpad.net/~fuel-dev
>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~fuel-dev
>> More help   : https://help.launchpad.net/ListHelp
>>
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp



-- 
Dmitry Borodaenko


Follow ups

References