nagios-charmers team mailing list archive
-
nagios-charmers team
-
Mailing list archive
-
Message #01196
[Bug 1906321] [NEW] We should allow tuning of host/service notification_interval
Public bug reported:
I've found through testing that it seems current Nagios on Bionic does
not re-notify alerts after downtimes if they had already alerted prior
to the downtime. While Nagios does have a "DOWNTIMEEND" notification
upon a downtime completing, it appears to be different than what we need
- we need alerts which are still in an error state to re-alert.
The simplest way I can see to do this is by setting the
notification_interval in the base host/service configs
(/etc/nagios3/conf.d/generic-{host,service}_nagios2.cfg) from 0 to some
other value, e.g. 10 or 20. This assumes the nagios default
interval_length of 60 seconds, meaning those would be 10 or 20 minute
retry intervals.
This may take some nuance to do this in a sane way.
The main use case for performing the above is for when PagerDuty
integration is in use. Per testing, repeat notifications from Nagios to
PagerDuty does not appear to create additional PagerDuty events when one
already exists for the host/service in question. This is true even when
events are snoozed in PagerDuty. Notifications also aren't sent during
downtimes or via Nagios-side "ack"s. The key change would be that when
a downtime expires or when a nagios-side ack is un-acked, and if the
event in PagerDuty was marked as resolved during that downtime/ack, then
re-notification would cause a new event to be made in PagerDuty, largely
mitigating "leakage" of something continuing to be a problem in Nagios
but not hitting PagerDuty because it had been marked "resolved" at some
point in the past.
The key weakness I can see with this approach is: email notification
doesn't ignore the "duplicate" alerts. The repeat notifications would
result in extra emails, which may be alarming to whomever is receiving
the alerts. "Snoozing" the PagerDuty events wouldn't prevent email
notifications from being sent; those would persist until the alert is
properly resolved (or downtimed/acked) via Nagios. So, if there is a
way we can also enable a slightly different policy between email and
pagerduty alerts, that would also help.
** Affects: charm-nagios
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Nagios
Charm developers, which is subscribed to Nagios Charm.
https://bugs.launchpad.net/bugs/1906321
Title:
We should allow tuning of host/service notification_interval
Status in Nagios Charm:
New
Bug description:
I've found through testing that it seems current Nagios on Bionic does
not re-notify alerts after downtimes if they had already alerted prior
to the downtime. While Nagios does have a "DOWNTIMEEND" notification
upon a downtime completing, it appears to be different than what we
need - we need alerts which are still in an error state to re-alert.
The simplest way I can see to do this is by setting the
notification_interval in the base host/service configs
(/etc/nagios3/conf.d/generic-{host,service}_nagios2.cfg) from 0 to
some other value, e.g. 10 or 20. This assumes the nagios default
interval_length of 60 seconds, meaning those would be 10 or 20 minute
retry intervals.
This may take some nuance to do this in a sane way.
The main use case for performing the above is for when PagerDuty
integration is in use. Per testing, repeat notifications from Nagios
to PagerDuty does not appear to create additional PagerDuty events
when one already exists for the host/service in question. This is
true even when events are snoozed in PagerDuty. Notifications also
aren't sent during downtimes or via Nagios-side "ack"s. The key
change would be that when a downtime expires or when a nagios-side ack
is un-acked, and if the event in PagerDuty was marked as resolved
during that downtime/ack, then re-notification would cause a new event
to be made in PagerDuty, largely mitigating "leakage" of something
continuing to be a problem in Nagios but not hitting PagerDuty because
it had been marked "resolved" at some point in the past.
The key weakness I can see with this approach is: email notification
doesn't ignore the "duplicate" alerts. The repeat notifications would
result in extra emails, which may be alarming to whomever is receiving
the alerts. "Snoozing" the PagerDuty events wouldn't prevent email
notifications from being sent; those would persist until the alert is
properly resolved (or downtimed/acked) via Nagios. So, if there is a
way we can also enable a slightly different policy between email and
pagerduty alerts, that would also help.
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nagios/+bug/1906321/+subscriptions
Follow ups