yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #42673
[Bug 1490837] Re: Sporadic incoherent metrics when driver.get_host_cpu_stats takes longer than 1 second to execute
** Changed in: nova
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1490837
Title:
Sporadic incoherent metrics when driver.get_host_cpu_stats takes
longer than 1 second to execute
Status in OpenStack Compute (nova):
Fix Released
Bug description:
When using the libvirt CPU monitor (i.e., virt_driver) for metrics
collection, I sporadically noticed cases where the values for
cpu.user.percent + cpu.kernel.percent + cpu.idle.percent didn't equal
100, which should be the case. This wasn't happening very often so it
was quite difficult to track down, but after adding several debug
logs, over time, I was able to track down the problem.
If you look at this code:
https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L52
... you'll notice that there is an inherent assumption that for a
given "round" of metrics gathering, there is a built-in assumption
that the collective time to call metric_driver.get_metric(metric_name)
(keep in mind there are 10 metrics right now) won't exceed 1 second
(if it does, it's considered the "next" round of metric collection)
... i.e., so for the first metric collection, we refresh the host CPU
stats and the subsequent n-1 calls simply use the cache ... this
yielding a coherent answer (i.e., the percentages would all sum up to
100% as you'd expect).
However, in some cases (e.g., if the system is undergoing stress, etc.), I've seen cases where this code:
https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L60
... takes more than 1 second to execute, which then causes [within the
"same" metrics round] the data to be refreshed, thus yielding
potentially incoherent results (e.g., summation of percentages < 100
or > 100 -- makes for some interesting data points). :-)
The fix is simple... let's just move the timestamp cache *after* the
host stats have been collected... problem solved.
P.S. This problem is occurring on Liberty (and I suspect it would
happen on older releases too).
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1490837/+subscriptions
References