yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1770220] Re: report client allocation retry handling insufficient

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1770220@xxxxxxxxxxxxxxxxxx>
Date: Sat, 12 May 2018 12:12:35 -0000
Reply-to: Bug 1770220 <1770220@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Reviewed:  https://review.openstack.org/567506
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=20b3a04f1373633444c1064acca232f9f59cc133
Submitter: Zuul
Branch:    master

commit 20b3a04f1373633444c1064acca232f9f59cc133
Author: Radoslav Gerganov <rgerganov@xxxxxxxxxx>
Date:   Thu May 10 11:20:01 2018 +0300

    Add random sleep between retry calls to placement
    
    Attempt to reduce the resource provider contention by adding a random
    sleep between retry calls to placement.
    
    Change-Id: I4b7217f652dc2f5ff59f9d6a0178fa8f9325f706
    Closes-Bug: #1770220


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1770220

Title:
  report client allocation retry handling insufficient

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  In stress testing of a nova+placement scenario where there is only one
  nova-compute process (and thus only one resource provider) but more
  than one thread worth of nova-scheduler it is fairly easy to trigger
  the "Failed scheduler client operation claim_resources: out of
  retries: Retry" error found near
  https://github.com/openstack/nova/blob/master/nova/scheduler/client/report.py#L110

  (In a quick test on a devstack with a fake compute driver, 100
  separate requests to boot one server, 13 failed for this reason.)

  If we imagine 4 threads:

  * A is one nova-scheduler
  * B is one placement request/response
  * C is another nova-scheduler
  * D is a different placement request/request

  A starts a PUT to /allocations, request B, at the start of which it
  reads the resource provider and gets a generation and the for whatever
  reason waits for a while. Then C starts a PUT to /allocations, request
  D, reads the same resource provider, same generation, but actually
  completes, getting to increment generation before B.

  When B gets to increment generation, it fails because now the
  generation it has is no good for the increment procedure.

  This is all working as expected but apparently is not ideal for high
  concurrency with low numbers of compute nodes.

  The currently retry loop has no sleep() and only counts up to 3
  retries. It might make sense for it to do a random sleep before
  retrying (so as to introduce a bit of jitter in the system), and
  perhaps retry more times.

  Input desired. Thoughts?

  Another option, of course, is "don't run with so few compute nodes",
  but as we can likely expect this kind of stress testing (it was a real
  life stress test that worked fine in older (pre-claims-in-the-
  scheduler) versions that exposed this) we may wish to make it happier.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1770220/+subscriptions

References

[Bug 1770220] [NEW] report client allocation retry handling insufficient
From: Chris Dent, 2018-05-09