yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1853071] Re: AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Rodolfo Alonso <1853071@xxxxxxxxxxxxxxxxxx>
Date: Wed, 30 Nov 2022 09:33:41 -0000
Reply-to: Bug 1853071 <1853071@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Bug closed due to lack of activity, please feel free to reopen if
needed.

** Changed in: neutron
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1853071

Title:
  AMQP disconnects, q-reports-plugin queue grows, leading to
  DBDeadlocks while trying to update agent heartbeats

Status in neutron:
  Won't Fix

Bug description:
  Since upgrading to Rocky, we have seen this issue pop up in several
  environments, small and large. First we see various AMQP/Rabbit
  related errors - missed heartbeats from neutron-server to rabbitmq,
  then repeated errors such as Socket Closed, Broken Pipe, etc...

  This continues on for a while and all agents report as dead. On the
  agent side, we see RPC timeouts when trying to report state.
  Meanwhile, the q-reports-plugin queue in rabbit grows, to 10k+ -
  presumably because neutron-server can't connect to Rabbit and process
  messages.

  
  Eventually sometime later, we see "DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')" errors when neutron-server is trying to update stale agent heartbeats. 

  
  Example of various AMQP related errors - all slightly different:

  2019-11-18 07:38:55,200.200 22488 ERROR
  oslo.messaging._drivers.impl_rabbit [req-
  cba0d0fa-8e5a-42f1-a93b-4bb398b22275 - - - - -]
  [eba472a9-021d-4738-801b-7944aad3e3af] AMQP server 127.0.0.1:5672
  closed the connection. Check login credentials: Socket closed:
  IOError: Socket closed

  2019-11-18 07:40:22,454.454 22489 ERROR
  oslo.messaging._drivers.impl_rabbit [-]
  [6eb8d074-02c7-4add-8d91-768cbae60fdc] AMQP server on 127.0.0.1:5672
  is unreachable: Too many heartbeats missed. Trying again in 1
  seconds.: ConnectionForced: Too many heartbeats missed

  2019-11-18 07:40:22,586.586 22489 ERROR
  oslo.messaging._drivers.impl_rabbit
  [req-0b9f092f-27f2-4be1-bdf5-2c5208e54321 - - - - -]
  [4b43df2c-cc3e-4442-807c-dcfd4057cb3d] AMQP server on 127.0.0.1:5672
  is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.:
  error: [Errno 32] Broken pipe

  2019-11-18 07:42:06,010.010 22487 WARNING
  oslo.messaging._drivers.impl_rabbit [-] Unexpected error during
  heartbeart thread processing, retrying...: error: [Errno 32] Broken
  pipe

  2019-11-18 07:58:26,692.692 22489 WARNING
  oslo.messaging._drivers.impl_rabbit [-] Unexpected error during
  heartbeart thread processing, retrying...: IOError: Socket closed

  2019-11-18 07:58:26,696.696 22489 ERROR
  oslo.messaging._drivers.impl_rabbit [-]
  [84273ffb-1610-44b1-aff7-d5e4606b7f59] AMQP server on 127.0.0.1:5672
  is unreachable: <AMQPError: unknown error>. Trying again in 1
  seconds.: RecoverableConnectionError: <AMQPError: unknown error>

  Along with following Broken Pipe stacktrace in oslo messaging:
  http://paste.openstack.org/show/786312/

  This continues for some time (30 min - 1 hour) and all agents report
  as dead, and we see following errors in rabbitmq broker logs: First
  missed heartbeat errors, then handshake_timeout errors:

  2019-11-18 07:41:01.448 [error] <0.6126.71> closing AMQP connection <0.6126.71> (127.0.0.1:39817 -> 127.0.0.1:5672 - neutron-server:22487:ee468e25-42d7-45b8-aea0-4f6fb58a9034):
  missed heartbeats from client, timeout: 60s
  2019-11-18 07:41:07.665 [error] <0.18727.72> closing AMQP connection <0.18727.72> (127.0.0.1:51762 -> 127.0.0.1:5672):
  {handshake_timeout,frame_header}


  
  Eventually we see Rabbitmq q-reports has grown and neutron reporting following DBDeadlock stacktrace:

  2019-11-18 08:51:14,505.505 22493 ERROR oslo_db.api
  [req-231004a2-d988-47b3-9730-d6b5276fdcf8 - - - - -] DB exceeded retry
  limit.: DBDeadlock: (_mysql_exceptions.OperationalError) (1205, 'Lock
  wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE
  agents SET heartbeat_timestamp=%s WHERE agents.id = %s'] [parameters:
  (datetime.datetime(2019, 11, 18, 8, 50, 23, 804716),
  '223c754e-9d7f-4df3-b5a5-9be4eb8692b0')] (Background on this error at:
  http://sqlalche.me/e/e3q8)

  Full stacktrace here: http://paste.openstack.org/show/786313/

  
  The only way to recover is we stop neutron-server and rabbitmq, kill any neutron workers still dangling (which they usually are), then restart. But then we see problem manifest days or a week later.

  Rabbitmq is on same host as neutron-server - it is all localhost
  communication. So we are unsure of why it can't heartbeat or connect.
  Also the subsequent DBDeadlock leads me to think there is some
  syncrhonization issue when neutron gets overwhelmed with outstanding
  RPC messages.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1853071/+subscriptions

References

[Bug 1853071] [NEW] AMQP disconnects, q-reports-plugin queue grows, leading to DBDeadlocks while trying to update agent heartbeats
From: Arjun Baindur, 2019-11-19