yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1838819] [NEW] Docs needed for tunables at large scale

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Fri, 02 Aug 2019 20:11:08 -0000
Reply-to: Bug 1838819 <1838819@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Various things come up in IRC every once in a while about configuration
options that need to be tweaked at large scale (blizzard, cern, etc)
which once you hit hundreds or thousands of compute nodes need to be
changed to avoid killing the control plane.

One such option is this:

https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.heal_instance_info_cache_interval

>From a blizzard operator:

(3:04:18 PM) eandersson: mriedem, we had to set heal_instance_info_cache high because it was killing our control plane
(3:05:41 PM) eandersson: It was getting real heavy on large sites with 1k nodes
(3:06:26 PM) eandersson: We also ended up adding a variance

Similarly, CERN had to totally disable this one:

https://docs.openstack.org/nova/latest/configuration/config.html#compute.resource_provider_association_refresh

And rely on SIGHUP / restart of the service if they needed to refresh
that cache.

We should put these things in the admin docs as we come across them so
we don't forget about this stuff when new operators/users come along and
hit scaling issues.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: docs performance

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1838819

Title:
  Docs needed for tunables at large scale

Status in OpenStack Compute (nova):
  New

Bug description:
  Various things come up in IRC every once in a while about
  configuration options that need to be tweaked at large scale
  (blizzard, cern, etc) which once you hit hundreds or thousands of
  compute nodes need to be changed to avoid killing the control plane.

  One such option is this:

  https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.heal_instance_info_cache_interval

  From a blizzard operator:

  (3:04:18 PM) eandersson: mriedem, we had to set heal_instance_info_cache high because it was killing our control plane
  (3:05:41 PM) eandersson: It was getting real heavy on large sites with 1k nodes
  (3:06:26 PM) eandersson: We also ended up adding a variance

  Similarly, CERN had to totally disable this one:

  https://docs.openstack.org/nova/latest/configuration/config.html#compute.resource_provider_association_refresh

  And rely on SIGHUP / restart of the service if they needed to refresh
  that cache.

  We should put these things in the admin docs as we come across them so
  we don't forget about this stuff when new operators/users come along
  and hit scaling issues.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1838819/+subscriptions