← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1938202] Re: Cannot adjust number of resources in one agent step during device handling.

 

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/802596
Committed: https://opendev.org/openstack/neutron/commit/1221cf1b24484d61622f16937f1d3a08a5c2c27f
Submitter: "Zuul (22348)"
Branch:    master

commit 1221cf1b24484d61622f16937f1d3a08a5c2c27f
Author: Mitya_Eremeev <mitossvyaz@xxxxxxx>
Date:   Tue Jul 27 21:14:02 2021 +0300

    Agent RPC step size in config.
    
    Number of resources for neutron to divide
    the large RPC call data sets always equals 100.
    In "big" networks even these numbers can be
    still huge and cause service timeouts.
    Now we can decrease these numbers in config.
    Default value equals 20 now.
    
    Closes-Bug: 1938202
    Change-Id: Idf545ad31398ded460b6c2ae1675dd5e9ae71440


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1938202

Title:
  Cannot adjust number of resources in one agent step during device
  handling.

Status in neutron:
  Fix Released

Bug description:
  Sometimes certain RPC calls time out during port update on compute nodes.
  It may happen when OVS finds a new port on br-int. E.g. a very regular operation.
  Timeout example from logs:

  File "/usr/lib/python2.7/dist-
  packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
  line 2384,

  in rpc_loop\n    port_info, provisioning_needed)\n', '  File
  "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158,

  in wrapper\n    result = f(*args, **kwargs)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
  line 1831,

  in process_network_ports\n    port_info.get(\'updated\', set()))\n', '
  File "/usr/lib/python2.7/dist-
  packages/neutron/agent/securitygroups_rpc.py", line 262,

  in setup_port_filters\n
  self.prepare_devices_filter(new_devices)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/agent/securitygroups_rpc.py", line 125,

  in decorated_function\n    *args, **kwargs)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/agent/securitygroups_rpc.py", line 137,

  in prepare_devices_filter\n    self._apply_port_filter(device_ids)\n',
  '  File "/usr/lib/python2.7/dist-
  packages/neutron/agent/securitygroups_rpc.py", line 147,

  in _apply_port_filter\n    self.context, list(device_ids)[i:i +
  step])\n', '  File "/usr/lib/python2.7/dist-
  packages/neutron/api/rpc/handlers/securitygroups_rpc.py", line 233,

  in security_group_info_for_devices\n    result =
  self.security_group_info_for_ports(context, ports)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/db/securitygroups_rpc_base.py", line 220,

  in security_group_info_for_ports\n    return
  self._get_security_group_member_ips(context, sg_info)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/db/securitygroups_rpc_base.py", line 224,

  in _get_security_group_member_ips\n    context,
  sg_info[\'sg_member_ips\'].keys())\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/api/rpc/handlers/securitygroups_rpc.py", line 328,

  in _select_ips_for_remote_group\n    for p in
  self.rcache.get_resources(\'Port\', filters):\n', '  File
  "/usr/lib/python2.7/dist-packages/neutron/agent/resource_cache.py",
  line 120,

  in get_resources\n    self._flood_cache_for_query(rtype,
  **filters)\n', '  File "/usr/lib/python2.7/dist-
  packages/neutron/agent/resource_cache.py", line 76,

  in _flood_cache_for_query\n    filter_kwargs=filter_kwargs)\n', '
  File "/usr/lib/python2.7/dist-packages/oslo_log/helpers.py", line 67,

  in wrapper\n    return method(*args, **kwargs)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/neutron/api/rpc/handlers/resources_rpc.py", line 112,

  in bulk_pull\n    version=resource_type_cls.VERSION,
  filter_kwargs=filter_kwargs)\n', '  File "/usr/lib/python2.7/dist-
  packages/neutron/common/rpc.py", line 189,

  in call\n    time.sleep(wait)\n', '  File "/usr/lib/python2.7/dist-
  packages/oslo_utils/excutils.py", line 220,

  in __exit__\n    self.force_reraise()\n', '  File
  "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196,

  in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n',
  '  File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line
  165,

  in call\n    return self._original_context.call(ctxt, method,
  **kwargs)\n', '  File "/usr/lib/python2.7/dist-
  packages/oslo_messaging/rpc/client.py", line 174,

  in call\n    retry=self.retry)\n', '  File "/usr/lib/python2.7/dist-
  packages/oslo_messaging/transport.py", line 131,

  in _send\n    timeout=timeout, retry=retry)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/oslo_messaging/_drivers/amqpdriver.py", line 625,

  in send\n    retry=retry)\n', '  File "/usr/lib/python2.7/dist-
  packages/oslo_messaging/_drivers/amqpdriver.py", line 614,

  in _send\n    result = self._waiter.wait(msg_id, timeout)\n', '  File
  "/usr/lib/python2.7/dist-
  packages/oslo_messaging/_drivers/amqpdriver.py", line 506,

  in wait\n    message = self.waiters.get(msg_id, timeout=timeout)\n', '
  File "/usr/lib/python2.7/dist-
  packages/oslo_messaging/_drivers/amqpdriver.py", line 394,

  in get\n    \'to message ID %s\' % msg_id)\n']: MessagingTimeout:
  Timed out waiting for a reply to message ID
  1b9cbbfe84f84d7581d53640a9d75222

  The reason for RPC timeout is the scale.
  We have a hundred of compute nodes with hundreds of networks and creates large stacks, that cause huge amounts of data to be sent between neutron-server and OVS agents.
  Often that leads to RPC timeouts.

  Increased load for this scenario is caused by L2Pop feature, which
  tells ovs-agents how to properly set up flooding flows. Each new
  spawned VM of the stack causes all computes that have ports from the
  same network to receive notifications. Upon notification OVS agent
  requests l2pop data, which is a lot. It all goes through RPC and is
  heavy on the side of neutron-server.

  The workaround to increase the RPC timeout actually doesn’t work,
  because it is there already - agents have exponential back off on RPC
  calls, so they increase timeout 2x after each failed attempt. It still
  happens because it depends on external factors such as how big are the
  stacks that are being created.

  The call that times out is “bulk_pull”, which sometimes returns about
  2000 objects (ports) and takes > 120 seconds to complete.

  What I noticed that it’s not quite linear. E.g. up to 100 objects are
  returned within 0.1 seconds, and then there’s this big unproportional
  jump in response time. Perhaps it depends on what extra info those
  ports have (I guess it’s security group related info)

  Looking at the agent code, there is only a single place where the
  bulk_pull is used, which is RemoteResourceCache

  Inside of it, one method fetches resources by ID so it looks it can’t
  produce that many resources in the query anyway.

  The other one that gets resources by an “arbitrary” query can, and is
  used only in security groups handling…

  Number of resources for neutron agent side functions
  to deal with large sets always equals 100.
  Number of resources for neutron to divide
  the large RPC call data sets always equals 20.
  In "big" networks even these numbers can be
  still huge and causes service timeouts.

  When ovs-agent configures ovs firewall, it shoots rpc query asking for
  ports with security group ids as filters.

  The query is "give me all ports related to given security group ids".

  In our environment a few security group ids corresponding to big networks result in fetching about 3000 ports for this query.
  When being load, each port is decorated with a few extra objects by sqlalchemy, that issues subqueries for each port.
  So as a result we're having thousands of sql queries for a single rpc request.

  My suggestion is where this rpc request is defined and instead of sending multiple security group ids, send them one by one.
  While this will not improve performance, it will allow to reduce time taken for each rpc request, distribute the load between neutron-server instances better and avoid rpc timeouts.
  Also, this is an easy fix and does not require changes to internal RPC APIs.

  https://opendev.org/openstack/neutron/src/branch/master/neutron/common/_constants.py#L58

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1938202/+subscriptions



References