yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #37440
[Bug 1490308] [NEW] In DHCP agent's sync_state, get_active_networks_info RPC times out, when there are large number of networks.
Public bug reported:
In our scale tests, for the scenario of supporting large number of
networks, we encountered frequent RPC timeouts for the
get_active_networks_info call in the sync_state method.
Once this timeout happens, it takes an indefinite amount of time for the
DHCP agent to recover as it keeps doing alot of redundant work.
Assume I am at provisioning some 600th tenant network and fail to enable
the DHCP for that network. So a resync is scheduled for this network
alone.
Now in the sync_state method, we fire get_active_networks_info call,
which doesn't have any 'filters'. Neutron server takes its own sweet
time to return as it had to,
1. fetch all networks from DB which are hosted on this agent and try to schedule
2. fetch subnets info for all networks ,
3. fetch ports info for all networks,
By the time the response comes, agent had already timed out the default
60sec timeout.
Though the step 1 makes sense for some cases, we don't need to get
subnet and ports info for all the networks, when we actually want to
resync only 1 network.
I think we need to resurrect the get_active_networks RPC and have
filtering in get_active_networks_info RPC.
P.S: Increasing the rpc_timeout is definetly an option, but given the
possible room of improvement in agent code, I do not want to call that
shot already.
** Affects: neutron
Importance: Undecided
Assignee: Sudhakar Gariganti (sudhakar-gariganti)
Status: New
** Tags: l3-ipam-dhcp
** Changed in: neutron
Assignee: (unassigned) => Sudhakar Gariganti (sudhakar-gariganti)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1490308
Title:
In DHCP agent's sync_state, get_active_networks_info RPC times out,
when there are large number of networks.
Status in neutron:
New
Bug description:
In our scale tests, for the scenario of supporting large number of
networks, we encountered frequent RPC timeouts for the
get_active_networks_info call in the sync_state method.
Once this timeout happens, it takes an indefinite amount of time for
the DHCP agent to recover as it keeps doing alot of redundant work.
Assume I am at provisioning some 600th tenant network and fail to
enable the DHCP for that network. So a resync is scheduled for this
network alone.
Now in the sync_state method, we fire get_active_networks_info call,
which doesn't have any 'filters'. Neutron server takes its own sweet
time to return as it had to,
1. fetch all networks from DB which are hosted on this agent and try to schedule
2. fetch subnets info for all networks ,
3. fetch ports info for all networks,
By the time the response comes, agent had already timed out the
default 60sec timeout.
Though the step 1 makes sense for some cases, we don't need to get
subnet and ports info for all the networks, when we actually want to
resync only 1 network.
I think we need to resurrect the get_active_networks RPC and have
filtering in get_active_networks_info RPC.
P.S: Increasing the rpc_timeout is definetly an option, but given the
possible room of improvement in agent code, I do not want to call that
shot already.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1490308/+subscriptions
Follow ups