← Back to team overview

launchpad-reviewers team mailing list archive

Re: [Merge] ~ilasc/canonical-is-prometheus:add-lp-ppa-publisher-alert into canonical-is-prometheus:master

 

Review: Needs Fixing



Diff comments:

> diff --git a/ols/launchpad.rules b/ols/launchpad.rules
> index 2db3f47..9a404ba 100644
> --- a/ols/launchpad.rules
> +++ b/ols/launchpad.rules
> @@ -23,6 +23,17 @@ groups:
>          labels:
>            severity: warning
>  
> +      - alert: LaunchpadPPAPublisherStuck
> +        expr: absent_over_time(lp_script_activity_count{env='production', name='publish-distro'}[30m])

Please make this `env='production',service='launchpad-ppa',name='publish-distro'`, to make sure that we don't catch other publisher instances.

My experience is that 30 minutes is a bit too much of a hair-trigger for this; the PPA publisher can be very bursty depending on what people are trying to publish, and is especially sensitive to large packages or large PPAs.  For example, there was a `publish-distro` run yesterday evening that took 2836 seconds (admittedly while PS5 was having some trouble, but it wouldn't have struck me as a definite indicator of trouble on its own), and we also need to allow a little slack for gaps between runs since the cron job also includes `process-accepted`.

Perhaps in future we can add finer-grained metrics so that we can have more sensitive alerts, but for now, I'd suggest `1h` as an initial value here so that we don't exhaust IS with false positives.  If need be, we have the option of adding more sensitive Grafana alerts that go to the ~launchpad-alerts channel (though I don't think that's particularly necessary here).

> +        for: 5m
> +        annotations:
> +          summary: ppa publisher has not run for 30m

For human-readable text, make this "PPA publisher ...".

> +          dashboard_url: https://grafana.admin.canonical.com/d/000000044/telegraf-host?orgId=1&from=now-2h&to=now&var-juju_controller=prodstack5-prodstack5-prodstack-is&var-juju_model=All&var-service=launchpad-ppa&var-juju_unit=launchpad-ppa%2F2

Drop `&var-juju_unit=launchpad-ppa%2F2` - it doesn't add any extra selectivity since that Juju application only has a single unit, and we generally want to avoid specifying individual units where possible.

> +          description: Launchpad Script {{ $labels.name }} is not running as expected.
> +          playbook_url: https://wiki.canonical.com/InformationInfrastructure/IS/LaunchpadScripts#LaunchpadPPAPublisherStuck
> +        labels:
> +          severity: warning
> +
>        - alert: LaunchpadFlagExpiredMembershipsStuck
>          expr: absent_over_time(lp_script_activity_count{env='production',host='loganberry',name='flag-expired-memberships'}[24h])
>          for: 1h


-- 
https://code.launchpad.net/~ilasc/canonical-is-prometheus/+git/canonical-is-prometheus/+merge/433207
Your team Launchpad code reviewers is subscribed to branch canonical-is-prometheus:master.



References