← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1490308] [NEW] In DHCP agent's sync_state, get_active_networks_info RPC times out, when there are large number of networks.

 

Public bug reported:

In our scale tests, for the scenario of supporting large number of
networks, we encountered frequent RPC timeouts for the
get_active_networks_info call in the sync_state method.

Once this timeout happens, it takes an indefinite amount of time for the
DHCP agent to recover as it keeps doing alot of redundant work.

Assume I am at provisioning some 600th tenant network and fail to enable
the DHCP for that network. So a resync is scheduled for this network
alone.

Now in the sync_state method, we fire get_active_networks_info call,
which doesn't have any 'filters'. Neutron server takes its own sweet
time to return as it had to,

1. fetch all networks from DB which are hosted on this agent and try to schedule 
2. fetch subnets info for all networks ,
3. fetch ports info for all networks,

By the time the response comes, agent had already timed out the default
60sec  timeout.

Though the step 1 makes sense for some cases, we don't need to get
subnet and ports info for all the networks, when we actually want to
resync only 1 network.

I think we need to resurrect the get_active_networks RPC and have
filtering in get_active_networks_info RPC.

P.S: Increasing the rpc_timeout is definetly an option, but given the
possible room of improvement in agent code, I do not want to call that
shot already.

** Affects: neutron
     Importance: Undecided
     Assignee: Sudhakar Gariganti (sudhakar-gariganti)
         Status: New


** Tags: l3-ipam-dhcp

** Changed in: neutron
     Assignee: (unassigned) => Sudhakar Gariganti (sudhakar-gariganti)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1490308

Title:
  In DHCP agent's sync_state, get_active_networks_info RPC times out,
  when there are large number of networks.

Status in neutron:
  New

Bug description:
  In our scale tests, for the scenario of supporting large number of
  networks, we encountered frequent RPC timeouts for the
  get_active_networks_info call in the sync_state method.

  Once this timeout happens, it takes an indefinite amount of time for
  the DHCP agent to recover as it keeps doing alot of redundant work.

  Assume I am at provisioning some 600th tenant network and fail to
  enable the DHCP for that network. So a resync is scheduled for this
  network alone.

  Now in the sync_state method, we fire get_active_networks_info call,
  which doesn't have any 'filters'. Neutron server takes its own sweet
  time to return as it had to,

  1. fetch all networks from DB which are hosted on this agent and try to schedule 
  2. fetch subnets info for all networks ,
  3. fetch ports info for all networks,

  By the time the response comes, agent had already timed out the
  default 60sec  timeout.

  Though the step 1 makes sense for some cases, we don't need to get
  subnet and ports info for all the networks, when we actually want to
  resync only 1 network.

  I think we need to resurrect the get_active_networks RPC and have
  filtering in get_active_networks_info RPC.

  P.S: Increasing the rpc_timeout is definetly an option, but given the
  possible room of improvement in agent code, I do not want to call that
  shot already.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1490308/+subscriptions


Follow ups