yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95286
[Bug 2096802] Re: keepalived spawn failures in neutron-l3-agent
*** This bug is a duplicate of bug 1561046 ***
https://bugs.launchpad.net/bugs/1561046
Thanks for confirming, and I see the 2023.2 backport, thanks. I did one
for 2023.1 as well.
I'm going to close as a duplicate of the other bug just so there is a
reference.
** This bug has been marked a duplicate of bug 1561046
If there is a /var/lib/neutron/ha_confs/<router-id>.pid then l3 agent fails to spawn a keepalived process for that router
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2096802
Title:
keepalived spawn failures in neutron-l3-agent
Status in neutron:
Fix Released
Bug description:
On ML2/OVS deployments with many Neutron L3 routers in HA mode, we can
see the following kind of errors in neutron-l3-agent logs:
2024-11-07 03:14:58.109 1289 ERROR neutron.agent.linux.external_process [-] keepalived for router with uuid d34b2bf3-878c-431d-946f-b8766555f5dc not found. The process should not have died
2024-11-07 03:14:58.110 1289 WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid d34b2bf3-878c-431d-946f-b8766555f5dc
Only a small, random subset of all the routers is affected.
This appears to be due to the presence of old PID files for
keepalived, which can make neutron-l3-agent fail to properly detect
that keepalived as not yet been started for a specific router, if
another keepalived process (for another router) has already been
started using the same PID.
I suspect that change
https://review.opendev.org/c/openstack/neutron/+/895832 might be the
source of the issue (introduced in Caracal but backported to
Antelope).
A workaround is to delete all the PID files before restarting
neutron-l3-agent, which is being proposed in kolla-ansible:
https://review.opendev.org/c/openstack/kolla-ansible/+/934383
It is probably easier for this bug to happen in a containerized
environment because the PIDs start from 1 after each restart of the
containers.
Version: Seen with both 2023.1 and 2024.1 using recent code.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2096802/+subscriptions
References