← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1293083] [NEW] report_interval too frequent; Causing load on service, failing high CPU usage operations

 

Public bug reported:

report_interval is how often an agent sends out a heartbeat to the
service. The Neutron service responds to these 'report_state' RPC
messages by updating the agent's heartbeat DB record. The last heartbeat
is then compared to the configured agent_down_time to determine if the
agent is up or down. The agent's status is used when scheduling networks
on DHCP and L3 agents.

The defaults are 4 seconds for report_interval and 9 for
agent_down_time.

On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on 16
nodes, and a Neutron service sitting on a dedicated powerful machine,
the service was idle with 20% CPU usage. Changing the report_interval to
28 seconds and agent_down_time to 60 seconds changed the CPU usage to
1%, and allowed bulk operations on a larger scale. (In this case:
Creating 30 instances at the same time with 60 ports). With the original
values the operation failed (The instances did not get IP addresses),
and with the new values we were able to boot 60 instances successfully.
Side note: This flow will work better once the Nova-Neutron race is
resolved, but that's orthogonal to this proposal.

** Affects: neutron
     Importance: Undecided
     Assignee: Assaf Muller (amuller)
         Status: In Progress


** Tags: production-defaults

** Changed in: neutron
     Assignee: (unassigned) => Assaf Muller (amuller)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1293083

Title:
  report_interval too frequent; Causing load on service, failing high
  CPU usage operations

Status in OpenStack Neutron (virtual network service):
  In Progress

Bug description:
  report_interval is how often an agent sends out a heartbeat to the
  service. The Neutron service responds to these 'report_state' RPC
  messages by updating the agent's heartbeat DB record. The last
  heartbeat is then compared to the configured agent_down_time to
  determine if the agent is up or down. The agent's status is used when
  scheduling networks on DHCP and L3 agents.

  The defaults are 4 seconds for report_interval and 9 for
  agent_down_time.

  On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on
  16 nodes, and a Neutron service sitting on a dedicated powerful
  machine, the service was idle with 20% CPU usage. Changing the
  report_interval to 28 seconds and agent_down_time to 60 seconds
  changed the CPU usage to 1%, and allowed bulk operations on a larger
  scale. (In this case: Creating 30 instances at the same time with 60
  ports). With the original values the operation failed (The instances
  did not get IP addresses), and with the new values we were able to
  boot 60 instances successfully. Side note: This flow will work better
  once the Nova-Neutron race is resolved, but that's orthogonal to this
  proposal.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1293083/+subscriptions


Follow ups

References