nagios-charmers team mailing list archive

Thread
Date

[Bug 1906321] Re: We should allow tuning of host/service notification_interval

To: nagios-charmers@xxxxxxxxxxxxxxxxxxx
From: Eric Chen <1906321@xxxxxxxxxxxxxxxxxx>
Date: Fri, 07 Jul 2023 03:44:03 -0000
Reply-to: Bug 1906321 <1906321@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

This charm is no longer being actively maintained. Please consider using the new Canonical Observability Stack instead.
(https://charmhub.io/topics/canonical-observability-stack)
I will close this feature request

** Changed in: charm-nagios
Status: New => Won't Fix

--
You received this bug notification because you are a member of Nagios
Charm developers, which is subscribed to Nagios Charm.
https://bugs.launchpad.net/bugs/1906321

Title:
We should allow tuning of host/service notification_interval

Status in Nagios Charm:
Won't Fix

Bug description:
I've found through testing that it seems current Nagios on Bionic does
not re-notify alerts after downtimes if they had already alerted prior
to the downtime. While Nagios does have a "DOWNTIMEEND" notification
upon a downtime completing, it appears to be different than what we
need - we need alerts which are still in an error state to re-alert.

The simplest way I can see to do this is by setting the
notification_interval in the base host/service configs
(/etc/nagios3/conf.d/generic-{host,service}_nagios2.cfg) from 0 to
some other value, e.g. 10 or 20. This assumes the nagios default
interval_length of 60 seconds, meaning those would be 10 or 20 minute
retry intervals.

This may take some nuance to do this in a sane way.

The main use case for performing the above is for when PagerDuty
integration is in use. Per testing, repeat notifications from Nagios
to PagerDuty does not appear to create additional PagerDuty events
when one already exists for the host/service in question. This is
true even when events are snoozed in PagerDuty. Notifications also
aren't sent during downtimes or via Nagios-side "ack"s. The key
change would be that when a downtime expires or when a nagios-side ack
is un-acked, and if the event in PagerDuty was marked as resolved
during that downtime/ack, then re-notification would cause a new event
to be made in PagerDuty, largely mitigating "leakage" of something
continuing to be a problem in Nagios but not hitting PagerDuty because
it had been marked "resolved" at some point in the past.

The key weakness I can see with this approach is: email notification
doesn't ignore the "duplicate" alerts. The repeat notifications would
result in extra emails, which may be alarming to whomever is receiving
the alerts. "Snoozing" the PagerDuty events wouldn't prevent email
notifications from being sent; those would persist until the alert is
properly resolved (or downtimed/acked) via Nagios. So, if there is a
way we can also enable a slightly different policy between email and
pagerduty alerts, that would also help.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nagios/+bug/1906321/+subscriptions

References

[Bug 1906321] [NEW] We should allow tuning of host/service notification_interval
From: Paul Goins, 2020-11-30