← Back to team overview

sts-sponsors team mailing list archive

[Bug 1874075] Re: rabbitmq-server startup timeouts differ between SysV and systemd

 

What happens (packages from -updates):

On Groovy: Assuming same behaviour as focal due to systemd service file.
Untested.

On Focal: The rabbitmq service will start, and stay in 'activating' mode
until the daemon notifies systemd that it has started up (type=notify).
Every 300 seconds / 5 minutes rabbitmq will log failure to synchronise
the message queue until rabbitmq2 returns, but the daemon never dies.
TimeoutStartSec=3600 or one hour, so daemon stays waiting for 1 hour,
with it soft resetting every 5 minutes as queue synchronisation timeouts
occur. Service will only change to 'active' when rabbitmq2 starts and
the message queue is synced.

>From what I understand, I don't think there is any problems on focal or
groovy. As long as rabbitmq2 comes up within an hour, things work. Note
because of this bug, groovy and upstream has now been changed to 10 min
timeout, down from 1hr.

On Eoan: Assuming same behaviour as focal due to systemd service file.
Untested.

On Bionic: The rabbitmq service will start, and runs a ExecStartPost
script that waits on the rabbitmq daemon. If this ExecStartPost script
times out (which it does after 90 seconds it seems, even though
documentation suggests infinite timeout), it terminates with a error
exit code, and since the Unit type=simple, systemd marks the service as
failed. There is no Restart=on-failure on Bionic's systemd unit, and
rabbitmq stays dead. Rabbitmq dies 90 seconds after boot, and will never
rejoin the cluster by itself. The machine needs to be power cycled, or
manual ssh in and restart rabbitmq services.

On Xenial: Assuming same behaviour as Bionic due to systemd service
file. Untested.

Suggested actions:
For Bionic: From my understanding of the problem and my testing, I found that replacing the systemd service file with the one from focal, which changes type=simple to type=notify, with a 1hr timeout, and restart=on-failure solves the problem. Notes: I checked the source code, and rabbitmq in bionic does indeed support type=notify, although, we need to add a dependency to the package, socat. See below commit for details:

commit: 2d6383bade61fea0b8652b72d25bb1a9f0d6133f
From: Alexey Lebedeff <alebedev@xxxxxxxxxxxx>
Date: Fri, 11 Mar 2016 17:42:15 +0300
Subject: Improve systemd integration
Link: https://github.com/rabbitmq/rabbitmq-server/commit/2d6383bade61fea0b8652b72d25bb1a9f0d6133f

Github Issue for above commit: https://github.com/rabbitmq/rabbitmq-
server/issues/664

Xenial: I need to dig into this. We will likely follow the same path as
bionic, but we need to be careful to ensure service type=notify is
sufficiently supported in rabbitmq 3.5.7 before we SRU the change. Will
also likely need socat as a dependency and maybe a backport of the above
commit.

** Bug watch added: github.com/rabbitmq/rabbitmq-server/issues #664
   https://github.com/rabbitmq/rabbitmq-server/issues/664

-- 
You received this bug notification because you are a member of STS
Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1874075

Title:
  rabbitmq-server startup timeouts differ between SysV and systemd

Status in rabbitmq-server package in Ubuntu:
  Fix Released
Status in rabbitmq-server source package in Xenial:
  Fix Committed
Status in rabbitmq-server source package in Bionic:
  Fix Committed
Status in rabbitmq-server source package in Eoan:
  Won't Fix
Status in rabbitmq-server source package in Focal:
  Fix Committed
Status in rabbitmq-server source package in Groovy:
  Fix Released
Status in rabbitmq-server package in Debian:
  New

Bug description:
  The startup timeouts were recently adjusted and synchronized between
  the SysV and systemd startup files.

  https://github.com/rabbitmq/rabbitmq-server-release/pull/129

  The new startup files should be included in this package.

  [Impact]

  After starting the RabbitMQ server process, the startup script will
  wait for the server to start by calling `rabbitmqctl wait` and will
  time out after 10 s.

  The startup time of the server depends on how quickly the Mnesia
  database becomes available and the server will time out after
  `mnesia_table_loading_retry_timeout` ms times
  `mnesia_table_loading_retry_limit` retries. By default this wait is
  30,000 ms times 10 retries, i.e. 300 s.

  The mismatch between these two timeout values might lead to the
  startup script failing prematurely while the server is still waiting
  for the Mnesia tables.

  This change introduces variable `RABBITMQ_STARTUP_TIMEOUT` and the
  `--timeout` option into the startup script. The default value for this
  timeout is set to 10 minutes (600 seconds).

  This change also updates the systemd service file to match the timeout
  values between the two service management methods.

  [Scope]

  Upstream patch: https://github.com/rabbitmq/rabbitmq-server-
  release/pull/129

  * Fix is not included in the Debian package
  * Fix is not included in any Ubuntu series

  * Groovy and Focal can apply the upstream patch as is
  * Bionic and Xenial need an additional fix in the systemd service file
    to set the `RABBITMQ_STARTUP_TIMEOUT` variable for the
    `rabbitmq-server-wait` helper script.

  [Test Case]

  In a clustered setup with two nodes, A and B.

  1. create queue on A
  2. shut down B
  3. shut down A
  4. boot B

  The broker on B will wait for A. The systemd service will wait for 10
  seconds and then fail. Boot A and the rabbitmq-server process on B
  will complete startup.

  [Regression Potential]

  This change alters the behavior of the startup scripts when the Mnesia
  database takes long to become available. This might lead to failures
  further down the service dependency chain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1874075/+subscriptions