← Back to team overview

nagios-charmers team mailing list archive

[Bug 1906321] [NEW] We should allow tuning of host/service notification_interval

 

Public bug reported:

I've found through testing that it seems current Nagios on Bionic does
not re-notify alerts after downtimes if they had already alerted prior
to the downtime.  While Nagios does have a "DOWNTIMEEND" notification
upon a downtime completing, it appears to be different than what we need
- we need alerts which are still in an error state to re-alert.

The simplest way I can see to do this is by setting the
notification_interval in the base host/service configs
(/etc/nagios3/conf.d/generic-{host,service}_nagios2.cfg) from 0 to some
other value, e.g. 10 or 20.  This assumes the nagios default
interval_length of 60 seconds, meaning those would be 10 or 20 minute
retry intervals.

This may take some nuance to do this in a sane way.

The main use case for performing the above is for when PagerDuty
integration is in use.  Per testing, repeat notifications from Nagios to
PagerDuty does not appear to create additional PagerDuty events when one
already exists for the host/service in question.  This is true even when
events are snoozed in PagerDuty.  Notifications also aren't sent during
downtimes or via Nagios-side "ack"s.  The key change would be that when
a downtime expires or when a nagios-side ack is un-acked, and if the
event in PagerDuty was marked as resolved during that downtime/ack, then
re-notification would cause a new event to be made in PagerDuty, largely
mitigating "leakage" of something continuing to be a problem in Nagios
but not hitting PagerDuty because it had been marked "resolved" at some
point in the past.

The key weakness I can see with this approach is: email notification
doesn't ignore the "duplicate" alerts.  The repeat notifications would
result in extra emails, which may be alarming to whomever is receiving
the alerts.  "Snoozing" the PagerDuty events wouldn't prevent email
notifications from being sent; those would persist until the alert is
properly resolved (or downtimed/acked) via Nagios.  So, if there is a
way we can also enable a slightly different policy between email and
pagerduty alerts, that would also help.

** Affects: charm-nagios
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Nagios
Charm developers, which is subscribed to Nagios Charm.
https://bugs.launchpad.net/bugs/1906321

Title:
  We should allow tuning of host/service notification_interval

Status in Nagios Charm:
  New

Bug description:
  I've found through testing that it seems current Nagios on Bionic does
  not re-notify alerts after downtimes if they had already alerted prior
  to the downtime.  While Nagios does have a "DOWNTIMEEND" notification
  upon a downtime completing, it appears to be different than what we
  need - we need alerts which are still in an error state to re-alert.

  The simplest way I can see to do this is by setting the
  notification_interval in the base host/service configs
  (/etc/nagios3/conf.d/generic-{host,service}_nagios2.cfg) from 0 to
  some other value, e.g. 10 or 20.  This assumes the nagios default
  interval_length of 60 seconds, meaning those would be 10 or 20 minute
  retry intervals.

  This may take some nuance to do this in a sane way.

  The main use case for performing the above is for when PagerDuty
  integration is in use.  Per testing, repeat notifications from Nagios
  to PagerDuty does not appear to create additional PagerDuty events
  when one already exists for the host/service in question.  This is
  true even when events are snoozed in PagerDuty.  Notifications also
  aren't sent during downtimes or via Nagios-side "ack"s.  The key
  change would be that when a downtime expires or when a nagios-side ack
  is un-acked, and if the event in PagerDuty was marked as resolved
  during that downtime/ack, then re-notification would cause a new event
  to be made in PagerDuty, largely mitigating "leakage" of something
  continuing to be a problem in Nagios but not hitting PagerDuty because
  it had been marked "resolved" at some point in the past.

  The key weakness I can see with this approach is: email notification
  doesn't ignore the "duplicate" alerts.  The repeat notifications would
  result in extra emails, which may be alarming to whomever is receiving
  the alerts.  "Snoozing" the PagerDuty events wouldn't prevent email
  notifications from being sent; those would persist until the alert is
  properly resolved (or downtimed/acked) via Nagios.  So, if there is a
  way we can also enable a slightly different policy between email and
  pagerduty alerts, that would also help.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nagios/+bug/1906321/+subscriptions


Follow ups