← Back to team overview

fuel-dev team mailing list archive

Re: Release blocker: Moving management vip breaks rabbitmq sessions

 

Mike & Team,

It's quite a few changes, more than I would like to see at the end of 4.1.
But on the other side, the 3 parts Dmitry mentioned have been tested in the
lab and seem to solve most of the issues with HA/failover, so I'm all for
including them into 4.1.

It would be good to know if there is any impact on the release date. We
have moved it last week from Friday 2/28 to Tuesday 3/4. This means we have
Mon MSK, Mon PT, and Tue MSK left to do the job.

Since it's not a regression (it doesn't work in 4.0 either), another option
would be not to include the fix and release as is. I don't think it's a
good idea though.

Thanks,
Roman

On Sat, Mar 1, 2014 at 2:46 AM, Dmitry Borodaenko
<dborodaenko@xxxxxxxxxxxx>wrote:

> The solution we have consists of 3 parts:
>
> 1) Reconfigure OpenStack services to bypass HAProxy and connect to
> RabbitMQ directly on the controllers. Our testing shows that this
> actually resolves the RabbitMQ side of the problem.
>
> I'm working on a fuel-library patch that will do that, should be
> mostly straight-forward except for working around all the code
> duplication and hardcoded values in different puppet modules. An
> action item for EU timezone that I think would be most helpful is to
> test the proposed configuration (rabbit_hosts=<controller-1 mgmt
> ip>:5672,<controller-2 mgmt ip>:5672, etc.) in as many different
> failover and vip move scenarios as possible.
>
> One more thing I'm considering that would be worth testing is to see
> if it would be even better to point the controller services to
> rabbit_hosts=127.0.0.1:5672, and leave only compute and other
> non-controller nodes with the enumeration of controller management IPs
> in rabbit_hosts.
>
> 2) Enable flush_routes option for management and public VIPs in crm
> configuration, and restart HAProxy via crm when vip moves (including
> after failover). Our testing shows that these two actions reduce the
> probability of services locking up waiting for a read syscall to time
> out on a hung MySQL connection.
>
> 3) Upgrade python-mysqldb to version 1.2.5 as requested in OSCI-1105,
> and modify mysql connection strings to include read_timeout=90 (I'm
> open to suggestions for the timeout value, since it drives the
> duration of possible service outage after failover it should
> definitely be lower than the Linux kernel default of 10 minutes, but
> long enough not to drop connections due to slow SQL queries). This is
> something I can't do without help from OSCI team: we need deb and rpm
> packages built and tested so that we know they're safe to include in
> 4.1, and can test them in combination with the other fixes.
>
> Thanks,
> -DmitryB
>
>
> On Sat, Mar 1, 2014 at 12:47 AM, Mike Scherbakov
> <mscherbakov@xxxxxxxxxxxx> wrote:
> > Folks,
> > what is the current status on this? I saw a few comments in bug, but
> > wondering about action items European timezone can take on Monday to
> > continue the path.
> >
> > Thanks,
> >
> >
> > On Fri, Feb 28, 2014 at 9:58 PM, Dmitry Borodaenko
> > <dborodaenko@xxxxxxxxxxxx> wrote:
> >>
> >> Dear all,
> >>
> >> Please make sure that all discussions that occur elsewhere (this ML
> >> thread, chats, etc.) end up reflected in the LaunchPad bug (even if a
> >> theory is discussed and then eliminated, it's useful to have it
> >> mentioned in the bug so that other people don't repeat the same line
> >> of investigation). I originally emailed fuel-dev@ to only attract
> >> attention to the problem, I did not intend to split the discussion.
> >>
> >> Thanks,
> >>
> >> On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn
> >> <mmosesohn@xxxxxxxxxxxx> wrote:
> >> > I started reaching out to our community folks, Dina and Dmitry.
> >> >
> >> > We tried a few variations, but the same result: nova and cinder
> >> > dislike having the AMQP backend shifted from underneath it.
> >> >
> >> > If we remove haproxy and connect directly to RabbitMQ on a virtual IP,
> >> > all nova and cinder services die when we shift the virtual IP to
> >> > another node. Neutron somehow survives and reconnects in about 25
> >> > seconds and picks up where it left off.
> >> >
> >> > For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov
> >> > asked me to provide a diff of what the RPC code is between neutron and
> >> > cinder to maybe determine why Neutron can resume connections, but
> >> > Cinder surely doesn't. Here is this diff:
> >> > http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/
> >> >
> >> > For more info:
> >> > Errors we see in Cinder logs:
> >> > http://pastie.org/private/w8iigjzijfczvsw5ddelwq
> >> > Errors we see in Neutron logs:
> >> > http://pastie.org/private/uelxryhbr42jijip0loe2w
> >> >
> >> > In the bug, mentioned earlier in this thread, we have a diagnostic
> >> > snapshot.
> >> >
> >> > We're still digging for leads to fix this HA failover issue.
> >> >
> >> > -Matthew
> >> >
> >> > On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <
> vkuklin@xxxxxxxxxxxx>
> >> > wrote:
> >> >> It will not help if you shut down the controller. The problem is that
> >> >> you
> >> >> have  hanged AMQP sessions which kombu driver does not look to handle
> >> >> correctly.
> >> >>
> >> >>
> >> >> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya
> >> >> <bdobrelia@xxxxxxxxxxxx>
> >> >> wrote:
> >> >>>
> >> >>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote:
> >> >>> > Team,
> >> >>> >
> >> >>> > Me and Ryan have spent all day investigating
> >> >>> > https://bugs.launchpad.net/fuel/+bug/1285449
> >> >>> >
> >> >>> > What we have found so far confirms that this is a critical bug
> that
> >> >>> > absolutely must be resolved before 4.1 is released.  I have
> >> >>> > documented
> >> >>> > our findings in the bug comments, someone please take over the
> >> >>> > investigation when you come to the office tomorrow morning MSK
> time.
> >> >>> >
> >> >>> > I have a feeling that once the root cause is found, the fix will
> be
> >> >>> > low-impact and will involve either change in HAProxy configuration
> >> >>> > for
> >> >>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something
> similar.
> >> >>> > But first we need to understand what exactly breaks, and why this
> >> >>> > only
> >> >>> > affects some services and not all of them.
> >> >>> >
> >> >>> > Thanks,
> >> >>> >
> >> >>>
> >> >>> Here is recent rabbitMQ discussion quote from the
> >> >>> Fuel-conductors-support team skype chat (RU + translation):
> >> >>>
> >> >>> Wednesday, February 26, 2014
> >> >>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть:
> >> >>> (I have a question)
> >> >>>
> >> >>> listen rabbitmq-openstack
> >> >>>   bind 192.168.0.2:5672
> >> >>>   balance  roundrobin
> >> >>>
> >> >>>   server  node-1 192.168.0.3:5673   check inter 5000 rise 2 fall 3
> >> >>>   server  node-2 192.168.0.4:5673   check inter 5000 rise 2 fall 3
> >> >>> backup
> >> >>>   server  node-3 192.168.0.5:5673   check inter 5000 rise 2 fall 3
> >> >>> backup
> >> >>>
> >> >>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и
> >> >>> active-passive?
> >> >>> (Why do we use roundrobin and active-passive at once for RabbitMQ?)
> >> >>>
> >> >>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался
> >> >>> (To make sure the connection wouldn't break)
> >> >>>
> >> >>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго
> в
> >> >>> виде мастер-слейв
> >> >>> (RabbitMQ clustering is restricted to master-slave only)
> >> >>>
> >> >>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то
> нода
> >> >>> с
> >> >>> запросом к слейву придет - та его на мастер отправит
> >> >>> (Hence, any node's query to the RabbitMQ slave would have been
> re-sent
> >> >>> to the master)
> >> >>>
> >> >>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси
> >> >>> всегда всех посылал на одну ноду
> >> >>> (Thats why HAproxy always redirects all queries to the single
> RabbitMQ
> >> >>> node)
> >> >>>
> >> >>> And I'm not clear with this explanation, honestly. Why couldn't we
> >> >>> make
> >> >>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ
> >> >>> nodes
> >> >>> skipping HAproxy at all? (because of this: "any node's query to the
> >> >>> RabbitMQ slave would have been re-sent to the master")
> >> >>>
> >> >>> Could that resolve the issue? I think I will investigate this option
> >> >>> as
> >> >>> well.
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Best regards,
> >> >>> Bogdan Dobrelya,
> >> >>> Skype #bogdando_at_yahoo.com
> >> >>> Irc #bogdando
> >> >>>
> >> >>> --
> >> >>> Mailing list: https://launchpad.net/~fuel-dev
> >> >>> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> >> >>> Unsubscribe : https://launchpad.net/~fuel-dev
> >> >>> More help   : https://help.launchpad.net/ListHelp
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Yours Faithfully,
> >> >> Vladimir Kuklin,
> >> >> Senior Deployment Engineer,
> >> >> Mirantis, Inc.
> >> >> +7 (495) 640-49-04
> >> >> +7 (926) 702-39-68
> >> >> Skype kuklinvv
> >> >> 45bk3, Vorontsovskaya Str.
> >> >> Moscow, Russia,
> >> >> www.mirantis.com
> >> >> www.mirantis.ru
> >> >> vkuklin@xxxxxxxxxxxx
> >> >>
> >> >> --
> >> >> Mailing list: https://launchpad.net/~fuel-dev
> >> >> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> >> >> Unsubscribe : https://launchpad.net/~fuel-dev
> >> >> More help   : https://help.launchpad.net/ListHelp
> >> >>
> >> >
> >> > --
> >> > Mailing list: https://launchpad.net/~fuel-dev
> >> > Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> >> > Unsubscribe : https://launchpad.net/~fuel-dev
> >> > More help   : https://help.launchpad.net/ListHelp
> >>
> >>
> >>
> >> --
> >> Dmitry Borodaenko
> >>
> >> --
> >> Mailing list: https://launchpad.net/~fuel-dev
> >> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> >> Unsubscribe : https://launchpad.net/~fuel-dev
> >> More help   : https://help.launchpad.net/ListHelp
> >
> >
> >
> >
> > --
> > Mike Scherbakov
> > #mihgen
>
>
>
> --
> Dmitry Borodaenko
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : fuel-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp
>

References