canonical-ubuntu-qa team mailing list archive

Thread
Date

[Merge] ~hyask/autopkgtest-cloud:skia/stats into autopkgtest-cloud:master

To: mp+467038@xxxxxxxxxxxxxxxxxx
From: Skia <mp+467038@xxxxxxxxxxxxxxxxxx>
Date: Mon, 10 Jun 2024 07:09:19 -0000
In-reply-to: <171800334109.1668599.17349575241576756907.launchpad@scripts.lp.internal>
Reply-to: mp+467038@xxxxxxxxxxxxxxxxxx
Sender: noreply@xxxxxxxxxxxxx

Skia has proposed merging ~hyask/autopkgtest-cloud:skia/stats into autopkgtest-cloud:master.

Requested reviews:
  Canonical's Ubuntu QA (canonical-ubuntu-qa)

For more details, see:
https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/467038

Stats collection from the autopkgtest log files, to help diagnose datacenter or architecture specific issues.
-- 
Your team Canonical's Ubuntu QA is requested to review the proposed merge of ~hyask/autopkgtest-cloud:skia/stats into autopkgtest-cloud:master.

diff --git a/charms/focal/autopkgtest-web/webcontrol/README.md b/charms/focal/autopkgtest-web/webcontrol/README.md
index ca35397..b45f7f6 100644
--- a/charms/focal/autopkgtest-web/webcontrol/README.md
+++ b/charms/focal/autopkgtest-web/webcontrol/README.md
@@ -39,3 +39,14 @@ popd
 
 Note: Having everything in a folder that is not in the charm folder avoids
 having issue next time you want to build the charm.
+
+
+## Analyzing production log files in mass to gather stats
+
+There is a self-documented Jupyter notebook in this folder, called
+`stats.ipynb`. You can play with this with the following:
+```
+sudo apt install jupyter-notebook
+# From the current folder:
+jupyter-notebook
+```
diff --git a/charms/focal/autopkgtest-web/webcontrol/stats.ipynb b/charms/focal/autopkgtest-web/webcontrol/stats.ipynb
new file mode 100644
index 0000000..2dbaddc
--- /dev/null
+++ b/charms/focal/autopkgtest-web/webcontrol/stats.ipynb
@@ -0,0 +1,195 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f9cee084",
+   "metadata": {},
+   "source": [
+    "# Analyze data from autopkgtest log files\n",
+    "\n",
+    "Two steps process:\n",
+    "1. Gather data with `./stats.py --collect-stats --download-db --since-days-ago 4 --until-days-ago 0`  \n",
+    "   This will fetch the latest `autopkgtest.db`, and collect data for the last four days including today.  \n",
+    "   Not giving the last two options will default to the last 30 complete days (that excludes today, which is by nature incomplete).\n",
+    "2. Play with the cells below to display what you want\n",
+    "\n",
+    "For convenience, you can run the scraping command from the cell just below, but if you do work on this a bit more, don't hesitate to run it from the shell instead if it's easier.\n",
+    "\n",
+    "The `stats.py` script is also the place where you can tweak the scrapping, like adding metrics. You can also use that to look for a specific regex, a bit like `retry-autopkgtest-regressions`, but with the power to select the time range, and not depend on what is present in the excuses page only."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48a388c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run ./stats.py --collect-stats --download-db --since-days-ago 4 --until-days-ago 0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3d3540d9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sqlite3\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# Update this path with the corresponding path to the database you want to analyze\n",
+    "db_path = \"./autopkgtest_2024-06-03 09:55:39.367132_with_stats.db\"\n",
+    "\n",
+    "db = sqlite3.connect(f\"file:{db_path}?mode=ro\")\n",
+    "sqlite3.paramstyle = \"named\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa686716",
+   "metadata": {},
+   "source": [
+    "## Averaged time to first boot\n",
+    "\n",
+    "The next cell shows the \"time to first boot\", which correspond to the time taken from starting `autopkgtest` to the point where it is able to print the `dpkg` architecture from inside the testbed, meaning the VM has been started, booted, and reached over SSH.\n",
+    "\n",
+    "Feel free to add more filtering, like the following, if you want to refine the data.\n",
+    "```sql\n",
+    "AND arch = \"arm64\"\n",
+    "AND datacenter = \"bos03\"\n",
+    "```\n",
+    "You can also seggregate by `autopkgtest-cloud-worker` unit by tweaking the `SELECT` and `GROUP BY` with the following:\n",
+    "```sql\n",
+    "SELECT concat(host, '-', datacenter, '-', arch) as datacenter\n",
+    "...\n",
+    "GROUP BY host, datacenter, date\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d16523b1",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "with db as db_con:\n",
+    "    db_con.row_factory = sqlite3.Row\n",
+    "  \n",
+    "    query = \"\"\"\n",
+    "    SELECT AVG(first_boot_time) as first_boot_time, concat(datacenter, '-', arch) as datacenter, substr(tests_stats.run_id, 1, 8) as date\n",
+    "    FROM tests_stats\n",
+    "    JOIN result ON result.run_id=tests_stats.run_id\n",
+    "    JOIN test ON test.id=result.test_id\n",
+    "    WHERE first_boot_time IS NOT NULL\n",
+    "    AND arch != 'armhf'\n",
+    "    GROUP BY datacenter, date\n",
+    "    \"\"\"\n",
+    "    df = pd.read_sql_query(query, db_con)\n",
+    "    # Get the date as datetime object\n",
+    "    df[\"date\"] = pd.to_datetime(df.date)\n",
+    "\n",
+    "    # Display data as a graph\n",
+    "    plt.figure(figsize=(14, 5))\n",
+    "\n",
+    "    # Plot point for each dc-arch over time\n",
+    "    for dc in df['datacenter'].sort_values().unique():\n",
+    "        dc_data = df[df['datacenter'] == dc]\n",
+    "        plt.plot(dc_data['date'], dc_data['first_boot_time'], label=dc)\n",
+    "\n",
+    "    # Add some title and labels\n",
+    "    plt.title('Time to first boot for each datacenter-arch over time')\n",
+    "    plt.xlabel('Date')\n",
+    "    plt.ylabel('Time to first boot')\n",
+    "    plt.legend()\n",
+    "\n",
+    "    # Plot the graph\n",
+    "    plt.show()\n",
+    "    print(df)\n",
+    "\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2ceba19c",
+   "metadata": {},
+   "source": [
+    "## Cumulated boot attempts\n",
+    "\n",
+    "The next cell show the cumulated boot attempts. Sometimes, the `nova` script is unable to reach the VM on first try, but will retry some amount of time, logging the failures. Depending on the particular issues, this can show some boot instabilities, network trouble, or anything.  \n",
+    "Since this graph shows cumulated boot attempts, it actually kinda counts the number of time a job had to retry to boot successfully, since the `boot_attempts` is almost always 1. This graph isn't scaled to the number of runned jobs to get a relative percentage, so this is heavily dependent on the number of actually run jobs. Still, this is somehow useful to get a hunch of instabilities in some particular datacenters.\n",
+    "\n",
+    "The same kind of tweaking to the query than the previous cell can be done here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4906b31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with db as db_con:\n",
+    "    db_con.row_factory = sqlite3.Row\n",
+    "  \n",
+    "    query = \"\"\"\n",
+    "    SELECT COUNT(boot_attempts) as boot_attempts, concat(datacenter, '-', arch) as datacenter, substr(tests_stats.run_id, 1, 8) as date\n",
+    "    FROM tests_stats\n",
+    "    JOIN result ON result.run_id=tests_stats.run_id\n",
+    "    JOIN test ON test.id=result.test_id\n",
+    "    WHERE boot_attempts IS NOT NULL\n",
+    "    AND arch != \"armhf\"\n",
+    "    GROUP BY datacenter, date\n",
+    "    ORDER BY date\n",
+    "    \"\"\"\n",
+    "    df = pd.read_sql_query(query, db_con)\n",
+    "    # Get the date as datetime object\n",
+    "    df[\"date\"] = pd.to_datetime(df.date)\n",
+    "\n",
+    "    # Display data as a graph\n",
+    "    plt.figure(figsize=(14, 5))\n",
+    "\n",
+    "    # Plot point for each dc-arch over time\n",
+    "    for dc in df['datacenter'].unique():\n",
+    "        dc_data = df[df['datacenter'] == dc]\n",
+    "        plt.plot(dc_data['date'], dc_data['boot_attempts'], label=dc)\n",
+    "\n",
+    "    # Add some title and labels\n",
+    "    plt.title('Cumulated boot attempts for each datacenter-arch over time')\n",
+    "    plt.xlabel('Date')\n",
+    "    plt.ylabel('Cumulated boot attempts')\n",
+    "    plt.legend()\n",
+    "\n",
+    "    # Plot the graph\n",
+    "    plt.show()\n",
+    "    print(df)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/charms/focal/autopkgtest-web/webcontrol/stats.py b/charms/focal/autopkgtest-web/webcontrol/stats.py
new file mode 100755
index 0000000..9970b2c
--- /dev/null
+++ b/charms/focal/autopkgtest-web/webcontrol/stats.py
@@ -0,0 +1,322 @@
+#!/usr/bin/python3
+
+import argparse
+import datetime
+import logging
+import os
+import re
+import sqlite3
+import sys
+from urllib.request import urlretrieve
+
+import requests
+
+sqlite3.paramstyle = "named"
+
+
+class Run:
+    """
+    This represent a job run.
+    To build this object, give its constructor a dict with at least the
+    following keys:
+    - package
+    - release
+    - arch
+    - run_id
+    It is very common to build that from a row coming from a DB query selecting
+    at least those fields.
+    """
+
+    def __init__(self, row):
+        self.package = row["package"]
+        self.release = row["release"]
+        self.arch = row["arch"]
+        self.run_id = row["run_id"]
+        self.log = None
+        self.log_lines = None
+        self.host = None
+        self.datacenter = None
+        self.first_boot_time = None
+        self.boot_attempts = None
+
+    def __str__(self):
+        return (
+            f"{self.package} ({self.release}/{self.arch}) - {self.log_url()}"
+        )
+
+    def as_row_dict(self):
+        return {
+            "run_id": self.run_id,
+            "first_boot_time": self.first_boot_time,
+            "boot_attempts": self.boot_attempts,
+            "host": self.host,
+            "datacenter": self.datacenter,
+        }
+
+    def log_url(self):
+        log_address_template = (
+            "https://autopkgtest.ubuntu.com/";
+            "results/autopkgtest-%(release)s/%(release)s/%(arch)s/"
+            "%(package_prefix)s/%(package)s/%(run_id)s/log.gz"
+        )
+        if self.package.startswith("lib"):
+            prefix = "lib" + self.package[3]
+        else:
+            prefix = self.package[0]
+        return log_address_template % {
+            "release": self.release,
+            "arch": self.arch,
+            "package_prefix": prefix,
+            "package": self.package,
+            "run_id": self.run_id,
+        }
+
+    def download_log(self):
+        try:
+            r = requests.get(self.log_url())
+            self.log = r.text
+            self.log_lines = self.log.splitlines()
+        except Exception as e:
+            print(e)
+            self.log = ""
+            self.log_lines = []
+
+    def extract_data(self):
+        """
+        This is the main part of this script, where we call all the heuristics
+        to extract the data from the logs:
+        - time to shell in testbed
+        - number of reboots
+        - number of nova retries
+        - datacenter
+        """
+        self.download_log()
+        self.extract_host()
+        self.extract_datacenter()
+        self.extract_first_boot_time()
+        self.extract_boot_attempts()
+
+    host_pattern = re.compile(
+        r"host juju-7f2275-prod-proposed-migration-environment-(\d);"
+    )
+
+    def extract_host(self):
+        try:
+            lines = [
+                l
+                for l in self.log_lines
+                if "command line: /home/ubuntu/autopkgtest/runner/autopkgtest"
+                in l
+            ]
+            m = self.host_pattern.search(lines[0])
+            if m:
+                self.host = m.group(1)
+        except Exception as e:
+            print(e)
+
+    datacenter_pattern = re.compile(r"@(.*)\.secgroup ")
+    lxd_remote_pattern = re.compile(r"lxd-armhf-(\d*\.\d*\.\d*\.\d*):")
+
+    def extract_datacenter(self):
+        try:
+            lines = [
+                l
+                for l in self.log_lines
+                if "command line: /home/ubuntu/autopkgtest/runner/autopkgtest"
+                in l
+            ]
+            if self.arch == "armhf":
+                m = self.lxd_remote_pattern.search(lines[0])
+                if m:
+                    self.datacenter = m.group(1)
+            else:
+                m = self.datacenter_pattern.search(lines[0])
+                if m:
+                    self.datacenter = m.group(1).split("-")[0]
+        except Exception as e:
+            print(e)
+
+    boot_time_pattern = re.compile(r"^ *(\d*)s .* testbed dpkg architecture: ")
+
+    def extract_first_boot_time(self):
+        try:
+            lines = [
+                l for l in self.log_lines if "testbed dpkg architecture: " in l
+            ]
+            if lines:
+                m = self.boot_time_pattern.search(lines[0])
+                if m:
+                    self.first_boot_time = int(m.group(1))
+        except Exception as e:
+            print(e)
+
+    boot_attempts_pattern = re.compile(r"nova boot failed \(attempt #(\d*)\)")
+
+    def extract_boot_attempts(self):
+        try:
+            lines = [
+                l for l in self.log_lines if "nova boot failed (attempt #" in l
+            ]
+            if lines:
+                m = self.boot_attempts_pattern.search(lines[0])
+                if m:
+                    self.boot_attempts = int(m.group(1))
+        except Exception as e:
+            print(e)
+
+
+def get_stats(db_con, since_days_ago, until_days_ago, limit):
+    try:
+        db_con.execute(
+            "CREATE TABLE IF NOT EXISTS tests_stats("
+            "  run_id CHAR[30], "
+            "  first_boot_time INTEGER, "
+            "  boot_attempts INTEGER, "
+            "  datacenter CHAR[10],"
+            "  host CHAR[10],"
+            "  PRIMARY KEY(run_id))"
+        )
+        print("Created tests_stats table")
+    except sqlite3.OperationalError as e:
+        if "already exists" not in str(e):
+            raise
+
+    start_date = datetime.datetime.now() - datetime.timedelta(
+        days=since_days_ago
+    )
+    end_date = datetime.datetime.now() - datetime.timedelta(
+        days=until_days_ago
+    )
+    print(f"Fetching stats from log files from {start_date} to {end_date}")
+
+    db_con.row_factory = sqlite3.Row
+    i = 0
+    failed = []
+    date = start_date
+    while date <= end_date:
+        for row in db_con.execute(
+            f"""
+            SELECT *
+            FROM result
+            JOIN test ON test.id=result.test_id
+            LEFT JOIN tests_stats ON tests_stats.run_id = result.run_id
+            WHERE tests_stats.run_id IS NULL
+            AND result.run_id LIKE '{date.strftime("%Y%m%d")}%'
+            AND arch = 'arm64'
+            ORDER BY RANDOM()
+            LIMIT {limit}
+            """
+        ):
+            r = Run(row)
+            r.download_log()
+            print(r)
+            r.extract_data()
+            try:
+                r.extract_data()
+                db_con.execute(
+                    """INSERT INTO tests_stats(
+                            run_id,
+                            first_boot_time,
+                            boot_attempts,
+                            host,
+                            datacenter
+                        )
+                        VALUES (
+                            :run_id,
+                            :first_boot_time,
+                            :boot_attempts,
+                            :host,
+                            :datacenter
+                        )""",
+                    r.as_row_dict(),
+                )
+                db_con.commit()
+                print(f"{i}\t| Extracted {r}")
+                print(f"\t| {r.as_row_dict()}")
+            except Exception as e:
+                failed.append((str(r), repr(e)))
+            i += 1
+        date += datetime.timedelta(days=1)
+
+
+def download_db():
+    url = "http://autopkgtest.ubuntu.com/static/autopkgtest.db";
+    dst = f"./autopkgtest_{datetime.datetime.now()}_with_stats.db"
+    print(f"Downloading database in {dst}")
+    urlretrieve(url, dst)
+    return dst
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Fetch AMQP queues into a file"
+    )
+    parser.add_argument(
+        "--debug",
+        action="store_true",
+        help="Show debugging logs",
+    )
+    parser.add_argument(
+        "--collect-stats",
+        action="store_true",
+        help="Collect the data from the logs, to populate the DB",
+    )
+    parser.add_argument(
+        "--download-db",
+        action="store_true",
+        help="Download a fresh database from autopkgtest.ubuntu.com to populate with stats",
+    )
+    parser.add_argument(
+        "--database",
+        type=str,
+        default="",
+        help="Populate this given database with stats",
+    )
+    parser.add_argument(
+        "--since-days-ago",
+        type=int,
+        default=30,
+        help="Fetching data for a time range starting X days ago (default: 30)",
+    )
+    parser.add_argument(
+        "--until-days-ago",
+        type=int,
+        default=1,
+        help="Fetch data for a time range ending X days ago (default: 1, meaning yesterday)",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=2000,
+        help="Number of jobs per days to fetch stats from (default: 2000)",
+    )
+    parser.add_argument(
+        "--clear-stats",
+        action="store_true",
+        help="Clear the DB from the existing stats, to start from scratch",
+    )
+
+    args = parser.parse_args()
+
+    if "DEBUG" in os.environ or args.debug:
+        logging.basicConfig(level=logging.DEBUG)
+
+    if args.download_db:
+        db_path = download_db()
+    elif args.database:
+        db_path = args.database
+    else:
+        print(
+            "You need to supply either --download-db or --database for the data collection to work"
+        )
+        sys.exit(1)
+
+    autopkgtest_db = sqlite3.connect(db_path)
+    with autopkgtest_db as db_con:
+        if args.clear_stats:
+            print("Clearing stats table")
+            db_con.execute("DROP TABLE IF EXISTS tests_stats")
+        if args.collect_stats:
+            get_stats(
+                db_con, args.since_days_ago, args.until_days_ago, args.limit
+            )

References

[Merge] ~hyask/autopkgtest-cloud:skia/stats into autopkgtest-cloud:master
From: Skia, 2024-06-10