← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1798806] [NEW] Race condition between RT and scheduler

 

Public bug reported:

The HostState object which is used by the scheduler is using the 'stats'
property of the compute node to derive its own values, e.g. :

    self.stats = compute.stats or {}
    self.num_instances = int(self.stats.get('num_instances', 0))
    self.num_io_ops = int(self.stats.get('io_workload', 0))
    self.failed_builds = int(self.stats.get('failed_builds', 0))

These values are used for both filtering and weighing compute hosts.
However, the 'stats' property of the compute node is cleared during the
periodic update_available_resources() and populated again. The clearing
occurs in RT._copy_resources() and it preserves only the old value of
'failed_builds'. This creates a race condition between RT and scheduler
which may result into populating wrong values for 'num_io_ops' and
'num_instances' into the HostState object and thus leading to incorrect
scheduling decisions.

** Affects: nova
     Importance: High
     Assignee: Radoslav Gerganov (rgerganov)
         Status: In Progress


** Tags: scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1798806

Title:
  Race condition between RT and scheduler

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  The HostState object which is used by the scheduler is using the
  'stats' property of the compute node to derive its own values, e.g. :

      self.stats = compute.stats or {}
      self.num_instances = int(self.stats.get('num_instances', 0))
      self.num_io_ops = int(self.stats.get('io_workload', 0))
      self.failed_builds = int(self.stats.get('failed_builds', 0))

  These values are used for both filtering and weighing compute hosts.
  However, the 'stats' property of the compute node is cleared during
  the periodic update_available_resources() and populated again. The
  clearing occurs in RT._copy_resources() and it preserves only the old
  value of 'failed_builds'. This creates a race condition between RT and
  scheduler which may result into populating wrong values for
  'num_io_ops' and 'num_instances' into the HostState object and thus
  leading to incorrect scheduling decisions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1798806/+subscriptions


Follow ups