yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #11673
[Bug 1293083] [NEW] report_interval too frequent; Causing load on service, failing high CPU usage operations
Public bug reported:
report_interval is how often an agent sends out a heartbeat to the
service. The Neutron service responds to these 'report_state' RPC
messages by updating the agent's heartbeat DB record. The last heartbeat
is then compared to the configured agent_down_time to determine if the
agent is up or down. The agent's status is used when scheduling networks
on DHCP and L3 agents.
The defaults are 4 seconds for report_interval and 9 for
agent_down_time.
On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on 16
nodes, and a Neutron service sitting on a dedicated powerful machine,
the service was idle with 20% CPU usage. Changing the report_interval to
28 seconds and agent_down_time to 60 seconds changed the CPU usage to
1%, and allowed bulk operations on a larger scale. (In this case:
Creating 30 instances at the same time with 60 ports). With the original
values the operation failed (The instances did not get IP addresses),
and with the new values we were able to boot 60 instances successfully.
Side note: This flow will work better once the Nova-Neutron race is
resolved, but that's orthogonal to this proposal.
** Affects: neutron
Importance: Undecided
Assignee: Assaf Muller (amuller)
Status: In Progress
** Tags: production-defaults
** Changed in: neutron
Assignee: (unassigned) => Assaf Muller (amuller)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1293083
Title:
report_interval too frequent; Causing load on service, failing high
CPU usage operations
Status in OpenStack Neutron (virtual network service):
In Progress
Bug description:
report_interval is how often an agent sends out a heartbeat to the
service. The Neutron service responds to these 'report_state' RPC
messages by updating the agent's heartbeat DB record. The last
heartbeat is then compared to the configured agent_down_time to
determine if the agent is up or down. The agent's status is used when
scheduling networks on DHCP and L3 agents.
The defaults are 4 seconds for report_interval and 9 for
agent_down_time.
On a setup with 18 agents (15 layer 2, L3, DHCP, metadata) sitting on
16 nodes, and a Neutron service sitting on a dedicated powerful
machine, the service was idle with 20% CPU usage. Changing the
report_interval to 28 seconds and agent_down_time to 60 seconds
changed the CPU usage to 1%, and allowed bulk operations on a larger
scale. (In this case: Creating 30 instances at the same time with 60
ports). With the original values the operation failed (The instances
did not get IP addresses), and with the new values we were able to
boot 60 instances successfully. Side note: This flow will work better
once the Nova-Neutron race is resolved, but that's orthogonal to this
proposal.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1293083/+subscriptions
Follow ups
References