← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1813703] [NEW] [L2] [summary] ovs-agent issues at large scale

 

Public bug reported:

[L2] [summary] ovs-agent issues at large scale

Recently we have tested the ovs-agent with the openvswitch flow based
security group, and we met some issues at large scale. This bug will
give us a centralized location to track the following problems.

Problems:
(1) RPC timeout during ovs-agent restart
(2) local connection to ovs-vswitchd was drop or timeout
(3) ovs-agent failed to restart
(4) ovs-agent restart costs too long time  (15-40mins+)
(5) unexpected flow lost
(6) unexpected tunnel lost
(7) multipe cookies flows (stale flows)
(8) dump-flows takes a lots of time
(9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows).

Problem can be seen in the following scenarios:
(1) 2000-3000 ports related to one single security group (or one remote security group)
(2) create 2000-3000 VMs in one single subnet (network)
(3) create 2000-3000 VMs under one single security group

Yes, the scale is the main problem, when one host's VM count is closing
to 150-200 (at the same time the ports number in one subnet or security
group is closing 2000), the ovs-agent restart will get worse.

Test ENV:
stable/queens

Deployment topology:
neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least.

Configurations:
ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following:
[agent]
enable_distributed_routing = True
l2_population = True
tunnel_types = vxlan
arp_responder = True
prevent_arp_spoofing = True
extensions = qos
report_interval = 60

[ovs]
bridge_mappings = tenant:br-vlan,external:br-ex
local_ip = 10.114.4.48

[securitygroup]
firewall_driver = openvswitch
enable_security_group = True

Some issue tracking:
(1) mostly because the great number of ports related to one security grop or in one network
(2) uncessary RPC call during ovs-agent restart
(3) inefficient database query conditions
(4) full sync will redo again and again if any exception was raised in rpc_loop
(5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming

So this is a summay bug for the entire scale issues we have met.

Some potential solutions:
Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc,
does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen.

One workaround is to disable the openvswitch flow based security group,
the ovs-agent can restart in less than 10 mins.

** Affects: neutron
     Importance: Undecided
         Status: New

** Description changed:

  [L2] [summary] ovs-agent issues at large scale
  
- Recently we have tested the ovs-agent with ovs-flow based firewall, and we met some issues at large scale.
- This bug will give us a centralized location to track the following problems.
+ Recently we have tested the ovs-agent with the openvswitch flow based
+ security group, and we met some issues at large scale. This bug will
+ give us a centralized location to track the following problems.
  
  Problems:
  (1) RPC timeout during ovs-agent restart
  (2) local connection to ovs-vswitchd was drop or timeout
  (3) ovs-agent failed to restart
  (4) ovs-agent restart costs too long time  (15-40mins+)
  (5) unexpected flow lost
  (6) unexpected tunnel lost
  (7) multipe cookies flows (stale flows)
  (8) dump-flows takes a lots of time
  (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows).
  
  Problem can be seen in the following scenarios:
  (1) 2000-3000 ports related to one single security group (or one remote security group)
  (2) create 2000-3000 VMs in one single subnet (network)
  (3) create 2000-3000 VMs under one single security group
  
  Yes, the scale is the main problem, when one host's VM count is closing
  to 150-200 (at the same time the ports number in one subnet or security
  group is closing 2000), the ovs-agent restart will get worse.
  
  Test ENV:
  stable/queens
  
  Deployment topology:
  neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least.
  
  Configurations:
  ovs-agent was setup with l2pop, firewall based on ovs flow, and the config was basiclly like the following:
  [agent]
  enable_distributed_routing = True
  l2_population = True
  tunnel_types = vxlan
  arp_responder = True
  prevent_arp_spoofing = True
  extensions = qos
  report_interval = 60
  
  [ovs]
  bridge_mappings = tenant:br-vlan,external:br-ex
  local_ip = 10.114.4.48
  
  [securitygroup]
  firewall_driver = openvswitch
  enable_security_group = True
  
  Some issue tracking:
- (1) mostly because the great number of ports related to one security grop or in one network 
+ (1) mostly because the great number of ports related to one security grop or in one network
  (2) uncessary RPC call during ovs-agent restart
  (3) inefficient database query conditions
  (4) full sync will redo again and again if any exception was raised in rpc_loop
  (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming
  
  So this is a summay bug for the entire scale issues we have met.
  
  Some potential solutions:
  Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc,
  does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen.
  
  One workaround is to disable the openvswitch flow based security group,
  the ovs-agent can restart in less than 10 mins.

** Description changed:

  [L2] [summary] ovs-agent issues at large scale
  
  Recently we have tested the ovs-agent with the openvswitch flow based
  security group, and we met some issues at large scale. This bug will
  give us a centralized location to track the following problems.
  
  Problems:
  (1) RPC timeout during ovs-agent restart
  (2) local connection to ovs-vswitchd was drop or timeout
  (3) ovs-agent failed to restart
  (4) ovs-agent restart costs too long time  (15-40mins+)
  (5) unexpected flow lost
  (6) unexpected tunnel lost
  (7) multipe cookies flows (stale flows)
  (8) dump-flows takes a lots of time
  (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows).
  
  Problem can be seen in the following scenarios:
  (1) 2000-3000 ports related to one single security group (or one remote security group)
  (2) create 2000-3000 VMs in one single subnet (network)
  (3) create 2000-3000 VMs under one single security group
  
  Yes, the scale is the main problem, when one host's VM count is closing
  to 150-200 (at the same time the ports number in one subnet or security
  group is closing 2000), the ovs-agent restart will get worse.
  
  Test ENV:
  stable/queens
  
  Deployment topology:
  neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least.
  
  Configurations:
- ovs-agent was setup with l2pop, firewall based on ovs flow, and the config was basiclly like the following:
+ ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following:
  [agent]
  enable_distributed_routing = True
  l2_population = True
  tunnel_types = vxlan
  arp_responder = True
  prevent_arp_spoofing = True
  extensions = qos
  report_interval = 60
  
  [ovs]
  bridge_mappings = tenant:br-vlan,external:br-ex
  local_ip = 10.114.4.48
  
  [securitygroup]
  firewall_driver = openvswitch
  enable_security_group = True
  
  Some issue tracking:
  (1) mostly because the great number of ports related to one security grop or in one network
  (2) uncessary RPC call during ovs-agent restart
  (3) inefficient database query conditions
  (4) full sync will redo again and again if any exception was raised in rpc_loop
  (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming
  
  So this is a summay bug for the entire scale issues we have met.
  
  Some potential solutions:
  Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc,
  does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen.
  
  One workaround is to disable the openvswitch flow based security group,
  the ovs-agent can restart in less than 10 mins.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813703

Title:
  [L2] [summary] ovs-agent issues at large scale

Status in neutron:
  New

Bug description:
  [L2] [summary] ovs-agent issues at large scale

  Recently we have tested the ovs-agent with the openvswitch flow based
  security group, and we met some issues at large scale. This bug will
  give us a centralized location to track the following problems.

  Problems:
  (1) RPC timeout during ovs-agent restart
  (2) local connection to ovs-vswitchd was drop or timeout
  (3) ovs-agent failed to restart
  (4) ovs-agent restart costs too long time  (15-40mins+)
  (5) unexpected flow lost
  (6) unexpected tunnel lost
  (7) multipe cookies flows (stale flows)
  (8) dump-flows takes a lots of time
  (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows).

  Problem can be seen in the following scenarios:
  (1) 2000-3000 ports related to one single security group (or one remote security group)
  (2) create 2000-3000 VMs in one single subnet (network)
  (3) create 2000-3000 VMs under one single security group

  Yes, the scale is the main problem, when one host's VM count is
  closing to 150-200 (at the same time the ports number in one subnet or
  security group is closing 2000), the ovs-agent restart will get worse.

  Test ENV:
  stable/queens

  Deployment topology:
  neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least.

  Configurations:
  ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following:
  [agent]
  enable_distributed_routing = True
  l2_population = True
  tunnel_types = vxlan
  arp_responder = True
  prevent_arp_spoofing = True
  extensions = qos
  report_interval = 60

  [ovs]
  bridge_mappings = tenant:br-vlan,external:br-ex
  local_ip = 10.114.4.48

  [securitygroup]
  firewall_driver = openvswitch
  enable_security_group = True

  Some issue tracking:
  (1) mostly because the great number of ports related to one security grop or in one network
  (2) uncessary RPC call during ovs-agent restart
  (3) inefficient database query conditions
  (4) full sync will redo again and again if any exception was raised in rpc_loop
  (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming

  So this is a summay bug for the entire scale issues we have met.

  Some potential solutions:
  Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc,
  does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen.

  One workaround is to disable the openvswitch flow based security
  group, the ovs-agent can restart in less than 10 mins.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813703/+subscriptions


Follow ups