yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #75874
[Bug 1795992] Re: retry_select_destinations decorator can make a mess with allocations in placement in a large multi-create request
Reviewed: https://review.openstack.org/607735
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5af632e9cab670cc25c2f627bb0d4c0a02258277
Submitter: Zuul
Branch: master
commit 5af632e9cab670cc25c2f627bb0d4c0a02258277
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Wed Oct 3 18:57:45 2018 -0400
Use long_rpc_timeout in select_destinations RPC call
Conductor RPC calls the scheduler to get hosts during
server create, which in a multi-create request with a
lot of servers and the default rpc_response_timeout, can
trigger a MessagingTimeout. Due to the old
retry_select_destinations decorator, conductor will retry
the select_destinations RPC call up to max_attempts times,
so thrice by default. This can clobber the scheduler and
placement while the initial scheduler worker is still
trying to process the beefy request and allocate resources
in placement.
This has been recreated in a devstack test patch [1] and
shown to fail with 1000 instances in a single request with
the default rpc_response_timeout of 60 seconds. Changing the
rpc_response_timeout to 300 avoids the MessagingTimeout and
retry loop.
Since Rocky we have the long_rpc_timeout config option which
defaults to 1800 seconds. The RPC client can thus be changed
to heartbeat the scheduler service during the RPC call every
$rpc_response_timeout seconds with a hard timeout of
$long_rpc_timeout. That change is made here.
As a result, the problematic retry_select_destinations
decorator is also no longer necessary and removed here. That
decorator was added in I2b891bf6d0a3d8f45fd98ca54a665ae78eab78b3
and was a hack for scheduler high availability where a
MessagingTimeout was assumed to be a result of the scheduler
service dying so retrying the request was reasonable to hit
another scheduler worker, but is clearly not sufficient
in the large multi-create case, and long_rpc_timeout is a
better fit for that HA type scenario to heartbeat the scheduler
service.
[1] https://review.openstack.org/507918/
Change-Id: I87d89967bbc5fbf59cf44d9a63eb6e9d477ac1f3
Closes-Bug: #1795992
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1795992
Title:
retry_select_destinations decorator can make a mess with allocations
in placement in a large multi-create request
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) rocky series:
Triaged
Bug description:
There is no rate-limiting on the min/max count parameters when
creating multiple servers in a single request besides the 'instances'
quota that the tenant has. However, some tenants could have a
relatively high instances quota (100 or more).
Because of the retry_select_destinations decorator in the scheduler
RPC API client code:
https://github.com/openstack/nova/blob/cce3208cc28268e4b50e155c205bcab9f1da2a4b/nova/scheduler/utils.py#L921
It can be relatively easy with a default rpc_response_timeout value to
cause the scheduler to retry a select_destinations call with a large
number of instances in a multi-create request to fail with a
MessagingTimeout and retry, which will make the scheduler process the
same set of instances and potentially create duplicate allocations in
placement, and subsequently fail to cleanup allocations because the
consumers generation has changed, as seen in the logs from this
recreate patch:
https://review.openstack.org/#/c/507918/
http://logs.openstack.org/18/507918/8/check/tempest-
full/a9f3849/controller/logs/screen-n-sch.txt.gz?level=TRACE#_Oct_02_23_29_12_475481
Oct 02 23:29:12.377430 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: WARNING nova.scheduler.client.report [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Unable to submit allocation for instance 63ae7544-7693-4749-886b-024dc93f09f9 (409 {"errors": [{"status": 409, "request_id": "req-0e85117c-871c-46c2-9e01-53c84e811b44", "code": "placement.undefined_code", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'MEMORY_MB' on resource provider 'b7709a93-f14c-42ed-addf-9736fb721728'. The requested amount would exceed the capacity. ", "title": "Conflict"}]})
Oct 02 23:29:12.472553 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: WARNING nova.scheduler.client.report [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Unable to delete allocation for instance 6962f92b-7dca-4912-aeb2-dcae03c4b52e: (409 {"errors": [{"status": 409, "request_id": "req-13df41fe-cb55-49f1-a998-09b34e48f05b", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 3 ", "title": "Conflict"}]})
Oct 02 23:29:12.475481 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Exception during message handling: AllocationDeleteFailed: Failed to delete allocations for consumer 6962f92b-7dca-4912-aeb2-dcae03c4b52e. Error: {"errors": [{"status": 409, "request_id": "req-13df41fe-cb55-49f1-a998-09b34e48f05b", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 3 ", "title": "Conflict"}]}
Oct 02 23:29:12.475908 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Oct 02 23:29:12.476075 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Oct 02 23:29:12.476251 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Oct 02 23:29:12.476424 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
Oct 02 23:29:12.476617 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Oct 02 23:29:12.476803 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
Oct 02 23:29:12.476969 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Oct 02 23:29:12.477123 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 226, in inner
Oct 02 23:29:12.477277 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
Oct 02 23:29:12.477439 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/manager.py", line 169, in select_destinations
Oct 02 23:29:12.477624 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server allocation_request_version, return_alternates)
Oct 02 23:29:12.477806 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 91, in select_destinations
Oct 02 23:29:12.477971 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server allocation_request_version, return_alternates)
Oct 02 23:29:12.478124 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 256, in _schedule
Oct 02 23:29:12.478268 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server claimed_instance_uuids)
Oct 02 23:29:12.478412 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 276, in _ensure_sufficient_hosts
Oct 02 23:29:12.478634 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server self._cleanup_allocations(context, claimed_uuids)
Oct 02 23:29:12.478808 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/filter_scheduler.py", line 301, in _cleanup_allocations
Oct 02 23:29:12.478965 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server self.placement_client.delete_allocation_for_instance(context, uuid)
Oct 02 23:29:12.479121 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 37, in __run_method
Oct 02 23:29:12.479463 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server return getattr(self.instance, __name)(*args, **kwargs)
Oct 02 23:29:12.479665 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/client/report.py", line 78, in wrapper
Oct 02 23:29:12.479849 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server return f(self, *a, **k)
Oct 02 23:29:12.480013 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/scheduler/client/report.py", line 2126, in delete_allocation_for_instance
Oct 02 23:29:12.480168 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server error=r.text)
Oct 02 23:29:12.480343 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server AllocationDeleteFailed: Failed to delete allocations for consumer 6962f92b-7dca-4912-aeb2-dcae03c4b52e. Error: {"errors": [{"status": 409, "request_id": "req-13df41fe-cb55-49f1-a998-09b34e48f05b", "code": "placement.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 3 ", "title": "Conflict"}]}
Oct 02 23:29:12.480508 ubuntu-xenial-limestone-regionone-0002536892 nova-scheduler[22653]: ERROR oslo_messaging.rpc.server
http://logs.openstack.org/18/507918/8/check/tempest-
full/a9f3849/controller/logs/screen-n-super-
cond.txt.gz?level=TRACE#_Oct_02_23_27_15_133261
Oct 02 23:27:15.133261 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: WARNING nova.scheduler.utils [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Retrying select_destinations after a MessagingTimeout, attempt 1 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID 8195ae78cdce460f8ecf9a8d0c616709
Oct 02 23:28:15.142976 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: WARNING nova.scheduler.utils [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Retrying select_destinations after a MessagingTimeout, attempt 2 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID 3aa6c2d17721416aaf7224af9198d3ce
Oct 02 23:29:15.155442 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager [None req-f4fe43ea-d117-4b7d-a3a4-23dcb59f3058 admin admin] Failed to schedule instances: MessagingTimeout: Timed out waiting for a reply to message ID 832f7daa5e764687921a00ebedaa5648
Oct 02 23:29:15.155708 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager Traceback (most recent call last):
Oct 02 23:29:15.155868 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/conductor/manager.py", line 1210, in schedule_and_build_instances
Oct 02 23:29:15.156075 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager instance_uuids, return_alternates=True)
Oct 02 23:29:15.156281 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/conductor/manager.py", line 725, in _schedule_instances
Oct 02 23:29:15.156432 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager return_alternates=return_alternates)
Oct 02 23:29:15.156601 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/scheduler/utils.py", line 907, in wrapped
Oct 02 23:29:15.156813 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager return func(*args, **kwargs)
Oct 02 23:29:15.156975 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 53, in select_destinations
Oct 02 23:29:15.157136 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager instance_uuids, return_objects, return_alternates)
Oct 02 23:29:15.157283 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 37, in __run_method
Oct 02 23:29:15.157449 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager return getattr(self.instance, __name)(*args, **kwargs)
Oct 02 23:29:15.157603 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/scheduler/client/query.py", line 42, in select_destinations
Oct 02 23:29:15.157757 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager instance_uuids, return_objects, return_alternates)
Oct 02 23:29:15.157904 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 158, in select_destinations
Oct 02 23:29:15.158137 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager return cctxt.call(ctxt, 'select_destinations', **msg_args)
Oct 02 23:29:15.158304 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
Oct 02 23:29:15.158515 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager retry=self.retry)
Oct 02 23:29:15.158723 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
Oct 02 23:29:15.158910 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager retry=retry)
Oct 02 23:29:15.159059 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
Oct 02 23:29:15.160294 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager call_monitor_timeout, retry=retry)
Oct 02 23:29:15.160454 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
Oct 02 23:29:15.160611 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager call_monitor_timeout)
Oct 02 23:29:15.160770 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
Oct 02 23:29:15.160924 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager message = self.waiters.get(msg_id, timeout=timeout)
Oct 02 23:29:15.161099 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
Oct 02 23:29:15.161253 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager 'to message ID %s' % msg_id)
Oct 02 23:29:15.161412 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager MessagingTimeout: Timed out waiting for a reply to message ID 832f7daa5e764687921a00ebedaa5648
Oct 02 23:29:15.161556 ubuntu-xenial-limestone-regionone-0002536892 nova-conductor[24583]: ERROR nova.conductor.manager
Since Rocky we've had the long_rpc_timeout config option which should
really be used for the select_destinations RPC call instead of the
retry_select_destinations decorator.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1795992/+subscriptions
References