← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2033980] Re: Neutron fails to respawn radvd due to corrupt pid file

 

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/895832
Committed: https://opendev.org/openstack/neutron/commit/c3b855a10080ab5b7d33f42aaee02e5ed50a4fdf
Submitter: "Zuul (22348)"
Branch:    master

commit c3b855a10080ab5b7d33f42aaee02e5ed50a4fdf
Author: Brian Haley <haleyb.dev@xxxxxxxxx>
Date:   Tue Sep 19 13:25:41 2023 -0400

    Remove obsolete PID files before start
    
    External processes, such as radvd, can refuse to start
    and throw an exception such as:
    
      "Unable to convert value in $pidfile"
    
    because the given pidfile has more than one PID in it.
    The situation can happen when the neutron node is reset
    and the obsolete PID files are not cleaned before neutron
    is started.
    
    This commit adds PID file cleanup before external
    process start.
    
    Closes-bug: #2033980
    Change-Id: Id62bf18067d0b144c3e8825c7603cc1e51dca052


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2033980

Title:
  Neutron fails to respawn radvd due to corrupt pid file

Status in kolla-ansible:
  Invalid
Status in neutron:
  Fix Released

Bug description:
  **Bug Report**

  What happened:

  I have had issues periodically where radvd seems to die and neutron is
  not able to respawn it. I'm not sure why it dies.

  In my neutron-l3-agent.log, the following error occurs once per
  minute:

  ```
  2023-09-03 14:37:07.514 16 ERROR neutron.agent.linux.utils [-] Unable to convert value in /var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd
  2023-09-03 14:37:07.514 16 ERROR neutron.agent.linux.external_process [-] radvd for router with uuid ea759c71-0f4d-4be9-a761-83843ce04d9a not found. The process should not have died
  2023-09-03 14:37:07.514 16 WARNING neutron.agent.linux.external_process [-] Respawning radvd for uuid ea759c71-0f4d-4be9-a761-83843ce04d9a
  2023-09-03 14:37:07.514 16 ERROR neutron.agent.linux.utils [-] Unable to convert value in /var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd
  2023-09-03 14:37:07.762 16 ERROR neutron.agent.linux.utils [-] Exit code: 255; Cmd: ['ip', 'netns', 'exec', 'qrouter-ea759c71-0f4d-4be9-a761-83843ce04d9a', 'env', 'PROCESS_TAG=radvd-ea759c71-0f4d-4be9-a761-83843ce04d9a', 'radvd', '-C', '/var/lib/neutron/ra/ea759c71-0f4d-4be9-a761-83843ce04d9a.radvd.conf', '-p', '/var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd', '-m', 'syslog', '-u', 'neutron']; Stdin: ; Stdout: ; Stderr:
  ```

  Inspecting the pid file, it appears to have 2 pids, one on each line:

  ```
  $ docker exec -it neutron_l3_agent cat /var/lib/neutron/external/pids/ea759c71-0f4d-4be9-a761-83843ce04d9a.pid.radvd
  853
  1161
  ```

  Deleting the file then properly respawns radvd:

  ```
  2023-09-03 14:38:07.515 16 ERROR neutron.agent.linux.external_process [-] radvd for router with uuid ea759c71-0f4d-4be9-a761-83843ce04d9a not found. The process should not have died
  2023-09-03 14:38:07.516 16 WARNING neutron.agent.linux.external_process [-] Respawning radvd for uuid ea759c71-0f4d-4be9-a761-83843ce04d9a
  ```

  What you expected to happen:

  Radvd is respawned without needing manual intervention. Likely what is
  meant to happen is neutron should write the pid to the file, whereas
  instead it appends it. I'm not sure if this is a kolla issue or a
  neutron issue.

  How to reproduce it (minimal and precise): Unsure, I'm not sure how
  radvd ends up dying in the first place. You could likely reproduce
  this by deploying kolla-ansible and then manually killing radvd.

  **Environment**:
  * OS (e.g. from /etc/os-release):
  NAME="Rocky Linux"
  VERSION="9.2 (Blue Onyx)"
  ID="rocky"
  ID_LIKE="rhel centos fedora"
  VERSION_ID="9.2"
  PLATFORM_ID="platform:el9"
  PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)"
  ANSI_COLOR="0;32"
  LOGO="fedora-logo-icon"
  CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
  HOME_URL="https://rockylinux.org/";
  BUG_REPORT_URL="https://bugs.rockylinux.org/";
  SUPPORT_END="2032-05-31"
  ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
  ROCKY_SUPPORT_PRODUCT_VERSION="9.2"
  REDHAT_SUPPORT_PRODUCT="Rocky Linux"
  REDHAT_SUPPORT_PRODUCT_VERSION="9.2"

  * Kernel (e.g. `uname -a`):
  Linux lon1 5.14.0-284.25.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  * Docker version if applicable (e.g. `docker version`):
  Client: Docker Engine - Community
   Version:           24.0.5
   API version:       1.43
   Go version:        go1.20.6
   Git commit:        ced0996
   Built:             Fri Jul 21 20:36:54 2023
   OS/Arch:           linux/amd64
   Context:           default

  Server: Docker Engine - Community
   Engine:
    Version:          24.0.5
    API version:      1.43 (minimum version 1.12)
    Go version:       go1.20.6
    Git commit:       a61e2b4
    Built:            Fri Jul 21 20:35:17 2023
    OS/Arch:          linux/amd64
    Experimental:     false
   containerd:
    Version:          1.6.22
    GitCommit:        8165feabfdfe38c65b599c4993d227328c231fca
   runc:
    Version:          1.1.8
    GitCommit:        v1.1.8-0-g82f18fe
   docker-init:
    Version:          0.19.0
    GitCommit:        de40ad0

  * Kolla-Ansible version (e.g. `git head or tag or stable branch` or pip package version if using release):
  16.1.0 (stable/2023.1)

  * Docker image Install type (source/binary): Default installed by kolla-ansible
  * Docker image distribution: rocky
  * Are you using official images from Docker Hub or self built? official
  * If self built - Kolla version and environment used to build: not applicable
  * Share your inventory file, globals.yml and other configuration files if relevant: Likely not relevant.

To manage notifications about this bug go to:
https://bugs.launchpad.net/kolla-ansible/+bug/2033980/+subscriptions