← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1940084] [NEW] neutron-agent causes 10m delay on start-up

 

Public bug reported:

When the environment starts (TripleO deployment), we wait until
pacemaker starts everything, but while that's been happening there are
neutron-agent services which have started by systemd but are waiting
with a 10 minute timeout for a RabbitMQ connection. Looking at code [1]
for resilience code for neutron agent - rabbitmq communication, it
doesn't take in account the start up case when connection to rabbit was
never established causing 10m delay. To solve the problem we should
specify the cases for resilience

1. Initial connection establishment. Connection to rabbit was never established, agent is trying to establish it (Initial startup of whole openstack cluster after power outage or planned reboot or one compute node reboot)
2. Connection to rabbit was established but connection was lost. In this case [1] does its job perfectly allowing to reduce load on rabbitmq
3. Connection was established but there is no reply from rabbitmq (rabbit is overloaded). In this case [1] does its job as well

To resolve case 1 we should introduce variable
is_connection_ever_established. If it's not set we should try to connect
every 20-30 seconds and set is_connection_ever_established==true when
connection established. When is_connection_ever_established==true but no
reply or connection lost we should use [1] algorithm. This change will
increase initial cluster startup or compute node reboot.





[1] https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180

** Affects: neutron
     Importance: Wishlist
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1940084

Title:
  neutron-agent causes 10m delay on start-up

Status in neutron:
  Confirmed

Bug description:
  When the environment starts (TripleO deployment), we wait until
  pacemaker starts everything, but while that's been happening there are
  neutron-agent services which have started by systemd but are waiting
  with a 10 minute timeout for a RabbitMQ connection. Looking at code
  [1] for resilience code for neutron agent - rabbitmq communication, it
  doesn't take in account the start up case when connection to rabbit
  was never established causing 10m delay. To solve the problem we
  should specify the cases for resilience

  1. Initial connection establishment. Connection to rabbit was never established, agent is trying to establish it (Initial startup of whole openstack cluster after power outage or planned reboot or one compute node reboot)
  2. Connection to rabbit was established but connection was lost. In this case [1] does its job perfectly allowing to reduce load on rabbitmq
  3. Connection was established but there is no reply from rabbitmq (rabbit is overloaded). In this case [1] does its job as well

  To resolve case 1 we should introduce variable
  is_connection_ever_established. If it's not set we should try to
  connect every 20-30 seconds and set
  is_connection_ever_established==true when connection established. When
  is_connection_ever_established==true but no reply or connection lost
  we should use [1] algorithm. This change will increase initial cluster
  startup or compute node reboot.







  
  [1] https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1940084/+subscriptions