← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1795992] Re: retry_select_destinations decorator can make a mess with allocations in placement in a large multi-create request

 

Reviewed:  https://review.openstack.org/607735
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5af632e9cab670cc25c2f627bb0d4c0a02258277
Submitter: Zuul
Branch:    master

commit 5af632e9cab670cc25c2f627bb0d4c0a02258277
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date:   Wed Oct 3 18:57:45 2018 -0400

    Use long_rpc_timeout in select_destinations RPC call
    
    Conductor RPC calls the scheduler to get hosts during
    server create, which in a multi-create request with a
    lot of servers and the default rpc_response_timeout, can
    trigger a MessagingTimeout. Due to the old
    retry_select_destinations decorator, conductor will retry
    the select_destinations RPC call up to max_attempts times,
    so thrice by default. This can clobber the scheduler and
    placement while the initial scheduler worker is still
    trying to process the beefy request and allocate resources
    in placement.
    
    This has been recreated in a devstack test patch [1] and
    shown to fail with 1000 instances in a single request with
    the default rpc_response_timeout of 60 seconds. Changing the
    rpc_response_timeout to 300 avoids the MessagingTimeout and
    retry loop.
    
    Since Rocky we have the long_rpc_timeout config option which
    defaults to 1800 seconds. The RPC client can thus be changed
    to heartbeat the scheduler service during the RPC call every
    $rpc_response_timeout seconds with a hard timeout of
    $long_rpc_timeout. That change is made here.
    
    As a result, the problematic retry_select_destinations
    decorator is also no longer necessary and removed here. That
    decorator was added in I2b891bf6d0a3d8f45fd98ca54a665ae78eab78b3
    and was a hack for scheduler high availability where a
    MessagingTimeout was assumed to be a result of the scheduler
    service dying so retrying the request was reasonable to hit
    another scheduler worker, but is clearly not sufficient
    in the large multi-create case, and long_rpc_timeout is a
    better fit for that HA type scenario to heartbeat the scheduler
    service.
    
    [1] https://review.openstack.org/507918/
    
    Change-Id: I87d89967bbc5fbf59cf44d9a63eb6e9d477ac1f3
    Closes-Bug: #1795992


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1795992

Title:
  retry_select_destinations decorator can make a mess with allocations
  in placement in a large multi-create request

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) rocky series:
  Triaged

Bug description:
  There is no rate-limiting on the min/max count parameters when
  creating multiple servers in a single request besides the 'instances'
  quota that the tenant has. However, some tenants could have a
  relatively high instances quota (100 or more).

  Because of the retry_select_destinations decorator in the scheduler
  RPC API client code:

  https://github.com/openstack/nova/blob/cce3208cc28268e4b50e155c205bcab9f1da2a4b/nova/scheduler/utils.py#L921

  It can be relatively easy with a default rpc_response_timeout value to
  cause the scheduler to retry a select_destinations call with a large
  number of instances in a multi-create request to fail with a
  MessagingTimeout and retry, which will make the scheduler process the
  same set of instances and potentially create duplicate allocations in
  placement, and subsequently fail to cleanup allocations because the
  consumers generation has changed, as seen in the logs from this
  recreate patch:

  https://review.openstack.org/#/c/507918/

  http://logs.openstack.org/18/507918/8/check/tempest-
  full/a9f3849/controller/logs/screen-n-sch.txt.gz?level=TRACE#_Oct_02_23_29_12_475481

  Oct 02 23:29:12.377430 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: WARNING nova.scheduler.client.report [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Unable to submit allocation for instance 63ae7544-7693-4749-886b-024dc93f09f9 (409 {"errors": [{"status": 409, "request_id": "req-0e85117c-871c-46c2-9e01-53c84e811b44", "code": "placement.undefined_code", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider 'b7709a93-f14c-42ed-addf-9736fb721728'. The requested amount would exceed the capacity.  ", "title": "Conflict"}]})
  Oct 02 23:29:12.472553 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: WARNING nova.scheduler.client.report [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Unable to delete allocation for instance 6962f92b-7dca-4912-aeb2-dcae03c4b52e: (409 {"errors": [{"status": 409, "request_id": "req-13df41fe-cb55-49f1-a998-09b34e48f05b", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 3  ", "title": "Conflict"}]})
  Oct 02 23:29:12.475481 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Exception during message handling: AllocationDeleteFailed: Failed to delete allocations for consumer 6962f92b-7dca-4912-aeb2-dcae03c4b52e. Error: {"errors": [{"status": 409, "request_id": "req-13df41fe-cb55-49f1-a998-09b34e48f05b", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 3  ", "title": "Conflict"}]}
  Oct 02 23:29:12.475908 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  Oct 02 23:29:12.476075 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
  Oct 02 23:29:12.476251 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  Oct 02 23:29:12.476424 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
  Oct 02 23:29:12.476617 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  Oct 02 23:29:12.476803 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
  Oct 02 23:29:12.476969 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  Oct 02 23:29:12.477123 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 226, in inner
  Oct 02 23:29:12.477277 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
  Oct 02 23:29:12.477439 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/manager.py", line 169, in select_destinations
  Oct 02 23:29:12.477624 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     allocation_request_version, return_alternates)
  Oct 02 23:29:12.477806 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 91, in select_destinations
  Oct 02 23:29:12.477971 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     allocation_request_version, return_alternates)
  Oct 02 23:29:12.478124 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 256, in _schedule
  Oct 02 23:29:12.478268 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     claimed_instance_uuids)
  Oct 02 23:29:12.478412 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 276, in _ensure_sufficient_hosts
  Oct 02 23:29:12.478634 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     self._cleanup_allocations(context, claimed_uuids)
  Oct 02 23:29:12.478808 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 301, in _cleanup_allocations
  Oct 02 23:29:12.478965 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     self.placement_client.delete_allocation_for_instance(context, uuid)
  Oct 02 23:29:12.479121 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 37, in __run_method
  Oct 02 23:29:12.479463 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     return getattr(self.instance, __name)(*args, **kwargs)
  Oct 02 23:29:12.479665 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/client/report.py", line 78, in wrapper
  Oct 02 23:29:12.479849 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     return f(self, *a, **k)
  Oct 02 23:29:12.480013 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server   File "/opt/stack/nova/nova/scheduler/client/report.py", line 2126, in delete_allocation_for_instance
  Oct 02 23:29:12.480168 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server     error=r.text)
  Oct 02 23:29:12.480343 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server AllocationDeleteFailed: Failed to delete allocations for consumer 6962f92b-7dca-4912-aeb2-dcae03c4b52e. Error: {"errors": [{"status": 409, "request_id": "req-13df41fe-cb55-49f1-a998-09b34e48f05b", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 3  ", "title": "Conflict"}]}
  Oct 02 23:29:12.480508 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server 

  http://logs.openstack.org/18/507918/8/check/tempest-
  full/a9f3849/controller/logs/screen-n-super-
  cond.txt.gz?level=TRACE#_Oct_02_23_27_15_133261

  Oct 02 23:27:15.133261 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: WARNING nova.scheduler.utils [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Retrying select_destinations after a MessagingTimeout, attempt 1 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID 8195ae78cdce460f8ecf9a8d0c616709
  Oct 02 23:28:15.142976 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: WARNING nova.scheduler.utils [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Retrying select_destinations after a MessagingTimeout, attempt 2 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID 3aa6c2d17721416aaf7224af9198d3ce
  Oct 02 23:29:15.155442 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Failed to schedule instances: MessagingTimeout: Timed out waiting for a reply to message ID 832f7daa5e764687921a00ebedaa5648
  Oct 02 23:29:15.155708 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager Traceback (most recent call last):
  Oct 02 23:29:15.155868 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/conductor/manager.py", line 1210, in schedule_and_build_instances
  Oct 02 23:29:15.156075 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     instance_uuids, return_alternates=True)
  Oct 02 23:29:15.156281 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/conductor/manager.py", line 725, in _schedule_instances
  Oct 02 23:29:15.156432 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     return_alternates=return_alternates)
  Oct 02 23:29:15.156601 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/scheduler/utils.py", line 907, in wrapped
  Oct 02 23:29:15.156813 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     return func(*args, **kwargs)
  Oct 02 23:29:15.156975 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 53, in select_destinations
  Oct 02 23:29:15.157136 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     instance_uuids, return_objects, return_alternates)
  Oct 02 23:29:15.157283 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 37, in __run_method
  Oct 02 23:29:15.157449 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     return getattr(self.instance, __name)(*args, **kwargs)
  Oct 02 23:29:15.157603 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/scheduler/client/query.py", line 42, in select_destinations
  Oct 02 23:29:15.157757 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     instance_uuids, return_objects, return_alternates)
  Oct 02 23:29:15.157904 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 158, in select_destinations
  Oct 02 23:29:15.158137 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     return cctxt.call(ctxt, 'select_destinations', **msg_args)
  Oct 02 23:29:15.158304 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
  Oct 02 23:29:15.158515 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     retry=self.retry)
  Oct 02 23:29:15.158723 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
  Oct 02 23:29:15.158910 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     retry=retry)
  Oct 02 23:29:15.159059 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
  Oct 02 23:29:15.160294 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     call_monitor_timeout, retry=retry)
  Oct 02 23:29:15.160454 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
  Oct 02 23:29:15.160611 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     call_monitor_timeout)
  Oct 02 23:29:15.160770 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
  Oct 02 23:29:15.160924 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     message = self.waiters.get(msg_id, timeout=timeout)
  Oct 02 23:29:15.161099 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
  Oct 02 23:29:15.161253 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager     'to message ID %s' % msg_id)
  Oct 02 23:29:15.161412 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager MessagingTimeout: Timed out waiting for a reply to message ID 832f7daa5e764687921a00ebedaa5648
  Oct 02 23:29:15.161556 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager 

  Since Rocky we've had the long_rpc_timeout config option which should
  really be used for the select_destinations RPC call instead of the
  retry_select_destinations decorator.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1795992/+subscriptions


References