← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1542491] Re: Scheduler update_aggregates race causes incorrect aggregate information

 

setting this to medium severity

there is an existing race in how the cache is updated.
the workaround is to periodically restart the scheduled to clear the cache.

this looks like it affects all stable releases of OpenStack.
however its unlikely but not impossible that a fix for this can be backported.

given the above I'm marking this as medium as there is a relatively simple workaround even if the detection of the
isuee is not trivial.


** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
       Status: Opinion => Triaged

** Changed in: nova
     Assignee: jingtao (liang888) => (unassigned)

** Tags added: api

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1542491

Title:
  Scheduler update_aggregates race causes incorrect aggregate
  information

Status in OpenStack Compute (nova):
  Triaged
Status in Ubuntu:
  Invalid

Bug description:
  It appears that if nova-api receives simultaneous requests to add a
  server to a host aggregate, then a race occurs that can lead to nova-
  scheduler having incorrect aggregate information in memory.

  One observed effect of this is that sometimes nova-scheduler will
  think a smaller number of hosts are a member of the aggregate than is
  in the nova database and will filter out a host that should not be
  filtered.

  Restarting nova-scheduler fixes the issue, as it reloads the aggregate
  information on startup.

  Nova package versions: 1:2015.1.2-0ubuntu2~cloud0

  Reproduce steps:

  Create a new os-aggregate and then populate an os-aggregate with
  simultaneous API POSTs, note timestamps:

  2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server [req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 439 time: 0.1865470
  2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server [req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.2995598
  2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server [req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 519 time: 0.3140590
  2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server [req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3759601
  2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server [req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3433032

  Schedule a VM

  Expected Result:
  nova-scheduler Availability Zone filter returns all members of the aggregate

  Actual Result:
  nova-scheduler believes there is only one hypervisor in the aggregate. The number will vary as it is a race:

  2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70
  2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84
  2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.413 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv1, oshv1) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.413 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter AvailabilityZoneFilter returned 1 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84

  Nova API calls show the correct number of members.

  
  I suspect that it is caused by the simultaneous processing or out-of-order receipt of update_aggregates RPC calls.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1542491/+subscriptions



References