nagios-charmers team mailing list archive
-
nagios-charmers team
-
Mailing list archive
-
Message #00923
[Bug 1877400] [NEW] Need ability to tune service checks to non-default notification profiles
Public bug reported:
Currently, when using
charmhelpers.contrib.charmsupport.nrpe.add_check(), service checks are
defined with a max_check_attempts = 4 and retry_check_interval = 1.
this means that when a service fault is detected, 4 checks of that
service must have a non-OK result to turn into a HARD fault that
requires notification through alerting (pagerduty, email, etc).
Some checks defined in NRPE and by other charms have known ebb and flow
of threshold crossing that results in self-resolved alerts. One such
example might be rabbitmq-server's unconsumed messages threshold,
wherein we know that when a nova/neutron node restarts, there is a
swelling of queues for up to 30 minutes of unconsumed fanout queues that
will be reaped by nova or neutron after an amount of time has passed.
It would be very useful to provide different max_check_attempts options
to charm developers and nrpe check developers to be able to identify
which checks should alert immediately, and which checks should,
potentially, not alert unless they've been active for 2 hours.
See https://bugs.launchpad.net/charm-hw-health/+bug/1876931 for an
example where having the ability to ignore IPMI hardware timeouts for a
couple hours would reduce operational overhead for services known to
have issues that self-resolve in normal circumstances and would continue
well past the check attempt timing if there is an actual issue.
** Affects: charm-nagios
Importance: Undecided
Status: New
** Affects: charm-nrpe
Importance: Undecided
Status: New
** Also affects: charm-nagios
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Nagios
Charm developers, which is subscribed to Nagios Charm.
https://bugs.launchpad.net/bugs/1877400
Title:
Need ability to tune service checks to non-default notification
profiles
Status in Nagios Charm:
New
Status in NRPE Charm:
New
Bug description:
Currently, when using
charmhelpers.contrib.charmsupport.nrpe.add_check(), service checks are
defined with a max_check_attempts = 4 and retry_check_interval = 1.
this means that when a service fault is detected, 4 checks of that
service must have a non-OK result to turn into a HARD fault that
requires notification through alerting (pagerduty, email, etc).
Some checks defined in NRPE and by other charms have known ebb and
flow of threshold crossing that results in self-resolved alerts. One
such example might be rabbitmq-server's unconsumed messages threshold,
wherein we know that when a nova/neutron node restarts, there is a
swelling of queues for up to 30 minutes of unconsumed fanout queues
that will be reaped by nova or neutron after an amount of time has
passed. It would be very useful to provide different
max_check_attempts options to charm developers and nrpe check
developers to be able to identify which checks should alert
immediately, and which checks should, potentially, not alert unless
they've been active for 2 hours.
See https://bugs.launchpad.net/charm-hw-health/+bug/1876931 for an
example where having the ability to ignore IPMI hardware timeouts for
a couple hours would reduce operational overhead for services known to
have issues that self-resolve in normal circumstances and would
continue well past the check attempt timing if there is an actual
issue.
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nagios/+bug/1877400/+subscriptions
Follow ups