yahoo-eng-team team mailing list archive

Thread
Date

[Bug 2096802] [NEW] keepalived spawn failures in neutron-l3-agent

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Pierre Riteau <2096802@xxxxxxxxxxxxxxxxxx>
Date: Mon, 27 Jan 2025 20:20:46 -0000
Reply-to: Bug 2096802 <2096802@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

On ML2/OVS deployments with many Neutron L3 routers in HA mode, we can
see the following kind of errors in neutron-l3-agent logs:

2024-11-07 03:14:58.109 1289 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid d34b2bf3-878c-431d-946f-b8766555f5dc not found. The process should not have died
2024-11-07 03:14:58.110 1289 WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid d34b2bf3-878c-431d-946f-b8766555f5dc

Only a small, random subset of all the routers is affected.

This appears to be due to the presence of old PID files for keepalived,
which can make neutron-l3-agent fail to properly detect that keepalived
as not yet been started for a specific router, if another keepalived
process (for another router) has already been started using the same
PID.

I suspect that change
https://review.opendev.org/c/openstack/neutron/+/895832 might be the
source of the issue (introduced in Caracal but backported to Antelope).

A workaround is to delete all the PID files before restarting
neutron-l3-agent, which is being proposed in kolla-ansible:
https://review.opendev.org/c/openstack/kolla-ansible/+/934383

It is probably easier for this bug to happen in a containerized
environment because the PIDs start from 1 after each restart of the
containers.

Version: Seen with both 2023.1 and 2024.1 using recent code.

** Affects: neutron
     Importance: Undecided
         Status: New

** Description changed:

- On ML2/OVS deployments with many Neutron L3 routers, we can see the
- following kind of errors in neutron-l3-agent logs:
+ On ML2/OVS deployments with many Neutron L3 routers in HA mode, we can
+ see the following kind of errors in neutron-l3-agent logs:
  
  2024-11-07 03:14:58.109 1289 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid d34b2bf3-878c-431d-946f-b8766555f5dc not found. The process should not have died
  2024-11-07 03:14:58.110 1289 WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid d34b2bf3-878c-431d-946f-b8766555f5dc
  
  Only a small, random subset of all the routers is affected.
  
  This appears to be due to the presence of old PID files for keepalived,
  which can make neutron-l3-agent fail to properly detect that keepalived
  as not yet been started for a specific router, if another keepalived
  process (for another router) has already been started using the same
  PID.
  
  I suspect that change
  https://review.opendev.org/c/openstack/neutron/+/895832 might be the
  source of the issue (introduced in Caracal but backported to Antelope).
  
  A workaround is to delete all the PID files before restarting
  neutron-l3-agent, which is being proposed in kolla-ansible:
  https://review.opendev.org/c/openstack/kolla-ansible/+/934383
  
  It is probably easier for this bug to happen in a containerized
  environment because the PIDs start from 1 after each restart of the
  containers.
  
  Version: Seen with both 2023.1 and 2024.1 using recent code.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2096802

Title:
  keepalived spawn failures in neutron-l3-agent

Status in neutron:
  New

Bug description:
  On ML2/OVS deployments with many Neutron L3 routers in HA mode, we can
  see the following kind of errors in neutron-l3-agent logs:

  2024-11-07 03:14:58.109 1289 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid d34b2bf3-878c-431d-946f-b8766555f5dc not found. The process should not have died
  2024-11-07 03:14:58.110 1289 WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid d34b2bf3-878c-431d-946f-b8766555f5dc

  Only a small, random subset of all the routers is affected.

  This appears to be due to the presence of old PID files for
  keepalived, which can make neutron-l3-agent fail to properly detect
  that keepalived as not yet been started for a specific router, if
  another keepalived process (for another router) has already been
  started using the same PID.

  I suspect that change
  https://review.opendev.org/c/openstack/neutron/+/895832 might be the
  source of the issue (introduced in Caracal but backported to
  Antelope).

  A workaround is to delete all the PID files before restarting
  neutron-l3-agent, which is being proposed in kolla-ansible:
  https://review.opendev.org/c/openstack/kolla-ansible/+/934383

  It is probably easier for this bug to happen in a containerized
  environment because the PIDs start from 1 after each restart of the
  containers.

  Version: Seen with both 2023.1 and 2024.1 using recent code.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2096802/+subscriptions

Follow ups

[Bug 2096802] Re: keepalived spawn failures in neutron-l3-agent
From: Brian Haley, 2025-01-30
[Bug 2096802] Re: keepalived spawn failures in neutron-l3-agent
From: Pierre Riteau, 2025-01-30