launchpad-reviewers team mailing list archive
-
launchpad-reviewers team
-
Mailing list archive
-
Message #29395
Re: [Merge] ~ilasc/canonical-is-prometheus:add-lp-ppa-publisher-alert into canonical-is-prometheus:master
Review: Needs Fixing
Diff comments:
> diff --git a/ols/launchpad.rules b/ols/launchpad.rules
> index 2db3f47..9a404ba 100644
> --- a/ols/launchpad.rules
> +++ b/ols/launchpad.rules
> @@ -23,6 +23,17 @@ groups:
> labels:
> severity: warning
>
> + - alert: LaunchpadPPAPublisherStuck
> + expr: absent_over_time(lp_script_activity_count{env='production', name='publish-distro'}[30m])
Please make this `env='production',service='launchpad-ppa',name='publish-distro'`, to make sure that we don't catch other publisher instances.
My experience is that 30 minutes is a bit too much of a hair-trigger for this; the PPA publisher can be very bursty depending on what people are trying to publish, and is especially sensitive to large packages or large PPAs. For example, there was a `publish-distro` run yesterday evening that took 2836 seconds (admittedly while PS5 was having some trouble, but it wouldn't have struck me as a definite indicator of trouble on its own), and we also need to allow a little slack for gaps between runs since the cron job also includes `process-accepted`.
Perhaps in future we can add finer-grained metrics so that we can have more sensitive alerts, but for now, I'd suggest `1h` as an initial value here so that we don't exhaust IS with false positives. If need be, we have the option of adding more sensitive Grafana alerts that go to the ~launchpad-alerts channel (though I don't think that's particularly necessary here).
> + for: 5m
> + annotations:
> + summary: ppa publisher has not run for 30m
For human-readable text, make this "PPA publisher ...".
> + dashboard_url: https://grafana.admin.canonical.com/d/000000044/telegraf-host?orgId=1&from=now-2h&to=now&var-juju_controller=prodstack5-prodstack5-prodstack-is&var-juju_model=All&var-service=launchpad-ppa&var-juju_unit=launchpad-ppa%2F2
Drop `&var-juju_unit=launchpad-ppa%2F2` - it doesn't add any extra selectivity since that Juju application only has a single unit, and we generally want to avoid specifying individual units where possible.
> + description: Launchpad Script {{ $labels.name }} is not running as expected.
> + playbook_url: https://wiki.canonical.com/InformationInfrastructure/IS/LaunchpadScripts#LaunchpadPPAPublisherStuck
> + labels:
> + severity: warning
> +
> - alert: LaunchpadFlagExpiredMembershipsStuck
> expr: absent_over_time(lp_script_activity_count{env='production',host='loganberry',name='flag-expired-memberships'}[24h])
> for: 1h
--
https://code.launchpad.net/~ilasc/canonical-is-prometheus/+git/canonical-is-prometheus/+merge/433207
Your team Launchpad code reviewers is subscribed to branch canonical-is-prometheus:master.
References