← Back to team overview

nagios-charmers team mailing list archive

[Bug 1877400] [NEW] Need ability to tune service checks to non-default notification profiles

 

Public bug reported:

Currently, when using
charmhelpers.contrib.charmsupport.nrpe.add_check(), service checks are
defined with a max_check_attempts = 4 and retry_check_interval = 1.
this means that when a service fault is detected, 4 checks of that
service must have a non-OK result to turn into a HARD fault that
requires notification through alerting (pagerduty, email, etc).

Some checks defined in NRPE and by other charms have known ebb and flow
of threshold crossing that results in self-resolved alerts.  One such
example might be rabbitmq-server's unconsumed messages threshold,
wherein we know that when a nova/neutron node restarts, there is a
swelling of queues for up to 30 minutes of unconsumed fanout queues that
will be reaped by nova or neutron after an amount of time has passed.
It would be very useful to provide different max_check_attempts options
to charm developers and nrpe check developers to be able to identify
which checks should alert immediately, and which checks should,
potentially, not alert unless they've been active for 2 hours.

See https://bugs.launchpad.net/charm-hw-health/+bug/1876931 for an
example where having the ability to ignore IPMI hardware timeouts for a
couple hours would reduce operational overhead for services known to
have issues that self-resolve in normal circumstances and would continue
well past the check attempt timing if there is an actual issue.

** Affects: charm-nagios
     Importance: Undecided
         Status: New

** Affects: charm-nrpe
     Importance: Undecided
         Status: New

** Also affects: charm-nagios
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Nagios
Charm developers, which is subscribed to Nagios Charm.
https://bugs.launchpad.net/bugs/1877400

Title:
  Need ability to tune service checks to non-default notification
  profiles

Status in Nagios Charm:
  New
Status in NRPE Charm:
  New

Bug description:
  Currently, when using
  charmhelpers.contrib.charmsupport.nrpe.add_check(), service checks are
  defined with a max_check_attempts = 4 and retry_check_interval = 1.
  this means that when a service fault is detected, 4 checks of that
  service must have a non-OK result to turn into a HARD fault that
  requires notification through alerting (pagerduty, email, etc).

  Some checks defined in NRPE and by other charms have known ebb and
  flow of threshold crossing that results in self-resolved alerts.  One
  such example might be rabbitmq-server's unconsumed messages threshold,
  wherein we know that when a nova/neutron node restarts, there is a
  swelling of queues for up to 30 minutes of unconsumed fanout queues
  that will be reaped by nova or neutron after an amount of time has
  passed.  It would be very useful to provide different
  max_check_attempts options to charm developers and nrpe check
  developers to be able to identify which checks should alert
  immediately, and which checks should, potentially, not alert unless
  they've been active for 2 hours.

  See https://bugs.launchpad.net/charm-hw-health/+bug/1876931 for an
  example where having the ability to ignore IPMI hardware timeouts for
  a couple hours would reduce operational overhead for services known to
  have issues that self-resolve in normal circumstances and would
  continue well past the check attempt timing if there is an actual
  issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nagios/+bug/1877400/+subscriptions


Follow ups