yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1940084] [NEW] neutron-agent causes 10m delay on start-up

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sergii Golovatiuk <1940084@xxxxxxxxxxxxxxxxxx>
Date: Mon, 16 Aug 2021 13:21:18 -0000
Reply-to: Bug 1940084 <1940084@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

When the environment starts (TripleO deployment), we wait until
pacemaker starts everything, but while that's been happening there are
neutron-agent services which have started by systemd but are waiting
with a 10 minute timeout for a RabbitMQ connection. Looking at code [1]
for resilience code for neutron agent - rabbitmq communication, it
doesn't take in account the start up case when connection to rabbit was
never established causing 10m delay. To solve the problem we should
specify the cases for resilience

1. Initial connection establishment. Connection to rabbit was never established, agent is trying to establish it (Initial startup of whole openstack cluster after power outage or planned reboot or one compute node reboot)
2. Connection to rabbit was established but connection was lost. In this case [1] does its job perfectly allowing to reduce load on rabbitmq
3. Connection was established but there is no reply from rabbitmq (rabbit is overloaded). In this case [1] does its job as well

To resolve case 1 we should introduce variable
is_connection_ever_established. If it's not set we should try to connect
every 20-30 seconds and set is_connection_ever_established==true when
connection established. When is_connection_ever_established==true but no
reply or connection lost we should use [1] algorithm. This change will
increase initial cluster startup or compute node reboot.

[1] https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180

** Affects: neutron
Importance: Wishlist
Assignee: Slawek Kaplonski (slaweq)
Status: Confirmed

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1940084

Title:
neutron-agent causes 10m delay on start-up

Status in neutron:
Confirmed

Bug description:
When the environment starts (TripleO deployment), we wait until
pacemaker starts everything, but while that's been happening there are
neutron-agent services which have started by systemd but are waiting
with a 10 minute timeout for a RabbitMQ connection. Looking at code
[1] for resilience code for neutron agent - rabbitmq communication, it
doesn't take in account the start up case when connection to rabbit
was never established causing 10m delay. To solve the problem we
should specify the cases for resilience

To resolve case 1 we should introduce variable
is_connection_ever_established. If it's not set we should try to
connect every 20-30 seconds and set
is_connection_ever_established==true when connection established. When
is_connection_ever_established==true but no reply or connection lost
we should use [1] algorithm. This change will increase initial cluster
startup or compute node reboot.

[1] https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1940084/+subscriptions