← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1490837] Re: Sporadic incoherent metrics when driver.get_host_cpu_stats takes longer than 1 second to execute

 

** Changed in: nova
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1490837

Title:
  Sporadic incoherent metrics when driver.get_host_cpu_stats takes
  longer than 1 second to execute

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  When using the libvirt CPU monitor (i.e., virt_driver) for metrics
  collection, I sporadically noticed cases where the values for
  cpu.user.percent + cpu.kernel.percent + cpu.idle.percent didn't equal
  100, which should be the case.  This wasn't happening very often so it
  was quite difficult to track down, but after adding several debug
  logs, over time, I was able to track down the problem.

  If you look at this code:
  https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L52

  ... you'll notice that there is an inherent assumption that for a
  given "round" of metrics gathering, there is a built-in assumption
  that the collective time to call metric_driver.get_metric(metric_name)
  (keep in mind there are 10 metrics right now) won't exceed 1 second
  (if it does, it's considered the "next" round of metric collection)
  ... i.e., so for the first metric collection, we refresh the host CPU
  stats and the subsequent n-1 calls simply use the cache ... this
  yielding a coherent answer (i.e., the percentages would all sum up to
  100% as you'd expect).

  However, in some cases (e.g., if the system is undergoing stress, etc.), I've seen cases where this code:
  https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L60

  ... takes more than 1 second to execute, which then causes [within the
  "same" metrics round] the data to be refreshed, thus yielding
  potentially incoherent results (e.g., summation of percentages < 100
  or > 100 -- makes for some interesting data points).  :-)

  The fix is simple... let's just move the timestamp cache *after* the
  host stats have been collected... problem solved.

  P.S.  This problem is occurring on Liberty (and I suspect it would
  happen on older releases too).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1490837/+subscriptions


References