← Back to team overview

fuel-dev team mailing list archive

Re: Release blocker: Moving management vip breaks rabbitmq sessions

 

Folks,
what is the current status on this? I saw a few comments in
bug<https://bugs.launchpad.net/fuel/+bug/1285449>,
but wondering about action items European timezone can take on Monday to
continue the path.

Thanks,


On Fri, Feb 28, 2014 at 9:58 PM, Dmitry Borodaenko <dborodaenko@xxxxxxxxxxxx
> wrote:

> Dear all,
>
> Please make sure that all discussions that occur elsewhere (this ML
> thread, chats, etc.) end up reflected in the LaunchPad bug (even if a
> theory is discussed and then eliminated, it's useful to have it
> mentioned in the bug so that other people don't repeat the same line
> of investigation). I originally emailed fuel-dev@ to only attract
> attention to the problem, I did not intend to split the discussion.
>
> Thanks,
>
> On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn
> <mmosesohn@xxxxxxxxxxxx> wrote:
> > I started reaching out to our community folks, Dina and Dmitry.
> >
> > We tried a few variations, but the same result: nova and cinder
> > dislike having the AMQP backend shifted from underneath it.
> >
> > If we remove haproxy and connect directly to RabbitMQ on a virtual IP,
> > all nova and cinder services die when we shift the virtual IP to
> > another node. Neutron somehow survives and reconnects in about 25
> > seconds and picks up where it left off.
> >
> > For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov
> > asked me to provide a diff of what the RPC code is between neutron and
> > cinder to maybe determine why Neutron can resume connections, but
> > Cinder surely doesn't. Here is this diff:
> > http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/
> >
> > For more info:
> > Errors we see in Cinder logs:
> http://pastie.org/private/w8iigjzijfczvsw5ddelwq
> > Errors we see in Neutron logs:
> http://pastie.org/private/uelxryhbr42jijip0loe2w
> >
> > In the bug, mentioned earlier in this thread, we have a diagnostic
> snapshot.
> >
> > We're still digging for leads to fix this HA failover issue.
> >
> > -Matthew
> >
> > On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <vkuklin@xxxxxxxxxxxx>
> wrote:
> >> It will not help if you shut down the controller. The problem is that
> you
> >> have  hanged AMQP sessions which kombu driver does not look to handle
> >> correctly.
> >>
> >>
> >> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya <
> bdobrelia@xxxxxxxxxxxx>
> >> wrote:
> >>>
> >>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote:
> >>> > Team,
> >>> >
> >>> > Me and Ryan have spent all day investigating
> >>> > https://bugs.launchpad.net/fuel/+bug/1285449
> >>> >
> >>> > What we have found so far confirms that this is a critical bug that
> >>> > absolutely must be resolved before 4.1 is released.  I have
> documented
> >>> > our findings in the bug comments, someone please take over the
> >>> > investigation when you come to the office tomorrow morning MSK time.
> >>> >
> >>> > I have a feeling that once the root cause is found, the fix will be
> >>> > low-impact and will involve either change in HAProxy configuration
> for
> >>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something similar.
> >>> > But first we need to understand what exactly breaks, and why this
> only
> >>> > affects some services and not all of them.
> >>> >
> >>> > Thanks,
> >>> >
> >>>
> >>> Here is recent rabbitMQ discussion quote from the
> >>> Fuel-conductors-support team skype chat (RU + translation):
> >>>
> >>> Wednesday, February 26, 2014
> >>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть:
> >>> (I have a question)
> >>>
> >>> listen rabbitmq-openstack
> >>>   bind 192.168.0.2:5672
> >>>   balance  roundrobin
> >>>
> >>>   server  node-1 192.168.0.3:5673   check inter 5000 rise 2 fall 3
> >>>   server  node-2 192.168.0.4:5673   check inter 5000 rise 2 fall 3
>  backup
> >>>   server  node-3 192.168.0.5:5673   check inter 5000 rise 2 fall 3
>  backup
> >>>
> >>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и
> >>> active-passive?
> >>> (Why do we use roundrobin and active-passive at once for RabbitMQ?)
> >>>
> >>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался
> >>> (To make sure the connection wouldn't break)
> >>>
> >>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго в
> >>> виде мастер-слейв
> >>> (RabbitMQ clustering is restricted to master-slave only)
> >>>
> >>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то нода
> с
> >>> запросом к слейву придет - та его на мастер отправит
> >>> (Hence, any node's query to the RabbitMQ slave would have been re-sent
> >>> to the master)
> >>>
> >>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси
> >>> всегда всех посылал на одну ноду
> >>> (Thats why HAproxy always redirects all queries to the single RabbitMQ
> >>> node)
> >>>
> >>> And I'm not clear with this explanation, honestly. Why couldn't we make
> >>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ nodes
> >>> skipping HAproxy at all? (because of this: "any node's query to the
> >>> RabbitMQ slave would have been re-sent to the master")
> >>>
> >>> Could that resolve the issue? I think I will investigate this option as
> >>> well.
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Bogdan Dobrelya,
> >>> Skype #bogdando_at_yahoo.com
> >>> Irc #bogdando
> >>>
> >>> --
> >>> Mailing list: https://launchpad.net/~fuel-dev
> >>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> >>> Unsubscribe : https://launchpad.net/~fuel-dev
> >>> More help   : https://help.launchpad.net/ListHelp
> >>
> >>
> >>
> >>
> >> --
> >> Yours Faithfully,
> >> Vladimir Kuklin,
> >> Senior Deployment Engineer,
> >> Mirantis, Inc.
> >> +7 (495) 640-49-04
> >> +7 (926) 702-39-68
> >> Skype kuklinvv
> >> 45bk3, Vorontsovskaya Str.
> >> Moscow, Russia,
> >> www.mirantis.com
> >> www.mirantis.ru
> >> vkuklin@xxxxxxxxxxxx
> >>
> >> --
> >> Mailing list: https://launchpad.net/~fuel-dev
> >> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> >> Unsubscribe : https://launchpad.net/~fuel-dev
> >> More help   : https://help.launchpad.net/ListHelp
> >>
> >
> > --
> > Mailing list: https://launchpad.net/~fuel-dev
> > Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> > Unsubscribe : https://launchpad.net/~fuel-dev
> > More help   : https://help.launchpad.net/ListHelp
>
>
>
> --
> Dmitry Borodaenko
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp
>



-- 
Mike Scherbakov
#mihgen

Follow ups

References