yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1948676] Re: rpc response timeout for agent report_state is not possible

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1948676@xxxxxxxxxxxxxxxxxx>
Date: Thu, 28 Oct 2021 15:56:33 -0000
Reply-to: Bug 1948676 <1948676@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/815310
Committed: https://opendev.org/openstack/neutron/commit/7d552848c272b4fbfdafdc552e54cefd25b6d46a
Submitter: "Zuul (22348)"
Branch:    master

commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a
Author: Tobias Urdin <tobias.urdin@xxxxxxxxx>
Date:   Mon Oct 25 13:52:03 2021 +0000

    Set RPC timeout in PluginReportStateAPI to report_interval
    
    See more details on why this is need in the referenced
    bug #1948676
    
    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1948676

Title:
  rpc response timeout for agent report_state is not possible

Status in neutron:
  Fix Released

Bug description:
  When hosting a large amount of routers and/or networks the RPC calls
  from the agents can take a long time which requires us to increase the
  rpc_response_timeout from the default of 60 seconds to a higher value
  for the agents to not timeout.

  This has the side effect that if a rabbitmq or neutron-server is
  restarted all agents that is currently reporting there will hang for a
  long time until report_state times out, during this time neutron-
  server has not got any reports causing it to set the agent as down.

  When it times out and tries again the reporting will succeed but a
  full sync will be triggered for all agents that was previously dead.
  This in itself can cause a very high load on the control plane.

  Consider the fact that a configuration change is deployed using
  tooling to all neutron-server nodes which is restarted, all agents
  will die, when they either 1) come back after rpc_response_timeout is
  reached and tries again or 2) is restarted manually all of them will
  do a full sync.

  We should have a configuration option that only applies to the rpc
  timeout for the report_state RPC call from agents because that could
  be lowered to be within the bounds of the agent not being seen as
  down.

  The old behavior can be kept by simply falling back to
  rpc_response_timeout by default instead of introducing a new default
  in this override.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1948676/+subscriptions

References

[Bug 1948676] [NEW] rpc response timeout for agent report_state is not possible
From: Tobias Urdin, 2021-10-25