yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1853071] [NEW] AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Arjun Baindur <xagent@xxxxxxxxx>
Date: Mon, 18 Nov 2019 23:57:46 -0000
Reply-to: Bug 1853071 <1853071@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Since upgrading to Rocky, we have seen this issue pop up in several
environments, small and large. First we see various AMQP/Rabbit related
errors - missed heartbeats from neutron-server to rabbitmq, then
repeated errors such as Socket Closed, Broken Pipe, etc...

This continues on for a while and all agents report as dead. On the
agent side, we see RPC timeouts when trying to report state. Meanwhile,
the q-reports-plugin queue in rabbit grows, to 10k+ - presumably because
neutron-server can't connect to Rabbit and process messages.


Eventually sometime later, we see "DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')" errors when neutron-server is trying to update stale agent heartbeats. 


Example of various AMQP related errors - all slightly different:

2019-11-18 07:38:55,200.200 22488 ERROR
oslo.messaging._drivers.impl_rabbit [req-cba0d0fa-8e5a-42f1-a93b-
4bb398b22275 - - - - -] [eba472a9-021d-4738-801b-7944aad3e3af] AMQP
server 127.0.0.1:5672 closed the connection. Check login credentials:
Socket closed: IOError: Socket closed

2019-11-18 07:40:22,454.454 22489 ERROR
oslo.messaging._drivers.impl_rabbit [-] [6eb8d074-02c7-4add-
8d91-768cbae60fdc] AMQP server on 127.0.0.1:5672 is unreachable: Too
many heartbeats missed. Trying again in 1 seconds.: ConnectionForced:
Too many heartbeats missed

2019-11-18 07:40:22,586.586 22489 ERROR
oslo.messaging._drivers.impl_rabbit [req-0b9f092f-
27f2-4be1-bdf5-2c5208e54321 - - - - -] [4b43df2c-cc3e-4442-807c-
dcfd4057cb3d] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 32]
Broken pipe. Trying again in 1 seconds.: error: [Errno 32] Broken pipe

2019-11-18 07:42:06,010.010 22487 WARNING
oslo.messaging._drivers.impl_rabbit [-] Unexpected error during
heartbeart thread processing, retrying...: error: [Errno 32] Broken pipe

2019-11-18 07:58:26,692.692 22489 WARNING
oslo.messaging._drivers.impl_rabbit [-] Unexpected error during
heartbeart thread processing, retrying...: IOError: Socket closed

2019-11-18 07:58:26,696.696 22489 ERROR
oslo.messaging._drivers.impl_rabbit [-]
[84273ffb-1610-44b1-aff7-d5e4606b7f59] AMQP server on 127.0.0.1:5672 is
unreachable: <AMQPError: unknown error>. Trying again in 1 seconds.:
RecoverableConnectionError: <AMQPError: unknown error>

Along with following Broken Pipe stacktrace in oslo messaging:
http://paste.openstack.org/show/786312/

This continues for some time (30 min - 1 hour) and all agents report as
dead, and we see following errors in rabbitmq broker logs: First missed
heartbeat errors, then handshake_timeout errors:

2019-11-18 07:41:01.448 [error] <0.6126.71> closing AMQP connection <0.6126.71> (127.0.0.1:39817 -> 127.0.0.1:5672 - neutron-server:22487:ee468e25-42d7-45b8-aea0-4f6fb58a9034):
missed heartbeats from client, timeout: 60s
2019-11-18 07:41:07.665 [error] <0.18727.72> closing AMQP connection <0.18727.72> (127.0.0.1:51762 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}


Eventually we see Rabbitmq q-reports has grown and neutron reporting
following DBDeadlock stacktrace:

2019-11-18 08:51:14,505.505 22493 ERROR oslo_db.api [req-
231004a2-d988-47b3-9730-d6b5276fdcf8 - - - - -] DB exceeded retry
limit.: DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock
wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE
agents SET heartbeat_timestamp=%s WHERE agents.id = %s'] [parameters:
(datetime.datetime(2019, 11, 18, 8, 50, 23, 804716), '223c754e-9d7f-
4df3-b5a5-9be4eb8692b0')] (Background on this error at:
http://sqlalche.me/e/e3q8)

Full stacktrace here: http://paste.openstack.org/show/786313/


The only way to recover is we stop neutron-server and rabbitmq, kill any neutron workers still dangling (which they usually are), then restart. But then we see problem manifest days or a week later.

Rabbitmq is on same host as neutron-server - it is all localhost
communication. So we are unsure of why it can't heartbeat or connect.
Also the subsequent DBDeadlock leads me to think there is some
syncrhonization issue when neutron gets overwhelmed with outstanding RPC
messages.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: db

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1853071

Title:
  AMQP disconnects, q-reports-plugin queue grows, leading to
  DBDeadlocks while trying to update agent heartbeats

Status in neutron:
  New

Bug description:
  Since upgrading to Rocky, we have seen this issue pop up in several
  environments, small and large. First we see various AMQP/Rabbit
  related errors - missed heartbeats from neutron-server to rabbitmq,
  then repeated errors such as Socket Closed, Broken Pipe, etc...

  This continues on for a while and all agents report as dead. On the
  agent side, we see RPC timeouts when trying to report state.
  Meanwhile, the q-reports-plugin queue in rabbit grows, to 10k+ -
  presumably because neutron-server can't connect to Rabbit and process
  messages.

  
  Eventually sometime later, we see "DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')" errors when neutron-server is trying to update stale agent heartbeats. 

  
  Example of various AMQP related errors - all slightly different:

  2019-11-18 07:38:55,200.200 22488 ERROR
  oslo.messaging._drivers.impl_rabbit [req-cba0d0fa-8e5a-42f1-a93b-
  4bb398b22275 - - - - -] [eba472a9-021d-4738-801b-7944aad3e3af] AMQP
  server 127.0.0.1:5672 closed the connection. Check login credentials:
  Socket closed: IOError: Socket closed

  2019-11-18 07:40:22,454.454 22489 ERROR
  oslo.messaging._drivers.impl_rabbit [-] [6eb8d074-02c7-4add-
  8d91-768cbae60fdc] AMQP server on 127.0.0.1:5672 is unreachable: Too
  many heartbeats missed. Trying again in 1 seconds.: ConnectionForced:
  Too many heartbeats missed

  2019-11-18 07:40:22,586.586 22489 ERROR
  oslo.messaging._drivers.impl_rabbit [req-0b9f092f-
  27f2-4be1-bdf5-2c5208e54321 - - - - -] [4b43df2c-cc3e-4442-807c-
  dcfd4057cb3d] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 32]
  Broken pipe. Trying again in 1 seconds.: error: [Errno 32] Broken pipe

  2019-11-18 07:42:06,010.010 22487 WARNING
  oslo.messaging._drivers.impl_rabbit [-] Unexpected error during
  heartbeart thread processing, retrying...: error: [Errno 32] Broken
  pipe

  2019-11-18 07:58:26,692.692 22489 WARNING
  oslo.messaging._drivers.impl_rabbit [-] Unexpected error during
  heartbeart thread processing, retrying...: IOError: Socket closed

  2019-11-18 07:58:26,696.696 22489 ERROR
  oslo.messaging._drivers.impl_rabbit [-]
  [84273ffb-1610-44b1-aff7-d5e4606b7f59] AMQP server on 127.0.0.1:5672
  is unreachable: <AMQPError: unknown error>. Trying again in 1
  seconds.: RecoverableConnectionError: <AMQPError: unknown error>

  Along with following Broken Pipe stacktrace in oslo messaging:
  http://paste.openstack.org/show/786312/

  This continues for some time (30 min - 1 hour) and all agents report
  as dead, and we see following errors in rabbitmq broker logs: First
  missed heartbeat errors, then handshake_timeout errors:

  2019-11-18 07:41:01.448 [error] <0.6126.71> closing AMQP connection <0.6126.71> (127.0.0.1:39817 -> 127.0.0.1:5672 - neutron-server:22487:ee468e25-42d7-45b8-aea0-4f6fb58a9034):
  missed heartbeats from client, timeout: 60s
  2019-11-18 07:41:07.665 [error] <0.18727.72> closing AMQP connection <0.18727.72> (127.0.0.1:51762 -> 127.0.0.1:5672):
  {handshake_timeout,frame_header}


  
  Eventually we see Rabbitmq q-reports has grown and neutron reporting following DBDeadlock stacktrace:

  2019-11-18 08:51:14,505.505 22493 ERROR oslo_db.api [req-
  231004a2-d988-47b3-9730-d6b5276fdcf8 - - - - -] DB exceeded retry
  limit.: DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock
  wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE
  agents SET heartbeat_timestamp=%s WHERE agents.id = %s'] [parameters:
  (datetime.datetime(2019, 11, 18, 8, 50, 23, 804716), '223c754e-9d7f-
  4df3-b5a5-9be4eb8692b0')] (Background on this error at:
  http://sqlalche.me/e/e3q8)

  Full stacktrace here: http://paste.openstack.org/show/786313/

  
  The only way to recover is we stop neutron-server and rabbitmq, kill any neutron workers still dangling (which they usually are), then restart. But then we see problem manifest days or a week later.

  Rabbitmq is on same host as neutron-server - it is all localhost
  communication. So we are unsure of why it can't heartbeat or connect.
  Also the subsequent DBDeadlock leads me to think there is some
  syncrhonization issue when neutron gets overwhelmed with outstanding
  RPC messages.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1853071/+subscriptions
Follow ups

[Bug 1853071] Re: AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats
From: Rodolfo Alonso, 2022-11-30