← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~stub/launchpad/db-deploy into lp:launchpad

 

Stuart Bishop has proposed merging lp:~stub/launchpad/db-deploy into lp:launchpad.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  Bug #777559 in Launchpad itself: "DB deploy preflight check must ignore SSO database connections"
  https://bugs.launchpad.net/launchpad/+bug/777559
  Bug #826653 in Launchpad itself: "preflight check could report patches to be applied"
  https://bugs.launchpad.net/launchpad/+bug/826653

For more details, see:
https://code.launchpad.net/~stub/launchpad/db-deploy/+merge/81562

= Summary =

Smooth out some fastdowntime db update issues.

 - We consistently block on some long running transactions we could ignore.
 - Prerollout checks should include confirmation of what patches will be applied, per Bug #826653

== Proposed fix ==

Report patches pending application in preflight.py

Define a list of BAD_USERS. These connections we know can have long running transactions but we have confirmed are safe to kill. This is a whitelist to ensure we open bugs for badly behaved scripts.

== Pre-implementation notes ==

== Implementation details ==

== Tests ==

== Demo and Q/A ==


= Launchpad lint =

Checking for conflicts and issues in changed files.

Linting changed files:
  database/schema/preflight.py

./database/schema/preflight.py
       7: '_pythonpath' imported but unused
-- 
https://code.launchpad.net/~stub/launchpad/db-deploy/+merge/81562
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~stub/launchpad/db-deploy into lp:launchpad.
=== modified file 'database/schema/preflight.py'
--- database/schema/preflight.py	2011-09-23 00:53:40 +0000
+++ database/schema/preflight.py	2011-11-08 11:52:29 +0000
@@ -15,6 +15,7 @@
 
 from datetime import timedelta
 from optparse import OptionParser
+import os.path
 import time
 
 import psycopg2
@@ -30,6 +31,7 @@
     logger_options,
     )
 import replication.helpers
+import upgrade
 
 
 # Ignore connections by these users.
@@ -50,7 +52,18 @@
     'publish_ftpmaster',
     ])
 
+# If these users have long running transactions, just kill 'em. Entries
+# added here must come with a bug number, a if part of Launchpad holds
+# open a long running transaction it is a bug we need to fix.
+BAD_USERS = frozenset([
+    'karma',  # Bug #863109
+    'rosettaadmin',  # Bug #863122
+    ])
+
 # How lagged the cluster can be before failing the preflight check.
+# If this is set too low, perfectly normal state will abort rollouts. If
+# this is set too high, then we will have unacceptable downtime as
+# replication needs to catch up before the database patches will apply.
 MAX_LAG = timedelta(seconds=60)
 
 
@@ -83,11 +96,19 @@
             self.lpmain_nodes = set(
                 node for node in self.nodes
                 if node.node_id in lpmain_node_ids)
+
+            # Store a reference to the lpmain origin.
+            lpmain_master_node_id = replication.helpers.get_master_node(
+                master_con, 1).node_id
+            self.lpmain_master_node = [
+                node for node in self.lpmain_nodes
+                    if node.node_id == lpmain_master_node_id][0]
         else:
             node = replication.helpers.Node(None, None, None, True)
             node.con = master_con
             self.nodes = set([node])
             self.lpmain_nodes = self.nodes
+            self.lpmain_master_node = node
 
     def check_is_superuser(self):
         """Return True if all the node connections are as superusers."""
@@ -176,6 +197,11 @@
         max_secs defines what is long running. For database rollouts,
         this will be short. Even if the transaction is benign like a
         autovacuum task, we should wait until things have settled down.
+
+        We ignore transactions held open by BAD_USERS. These are bugs
+        that need to be fixed, but we have determined that rudely aborting
+        them is fine for now and there is no need to block a rollout on
+        their behalf.
         """
         success = True
         for node in self.nodes:
@@ -190,10 +216,15 @@
                     AND datname=current_database()
                 """ % max_secs)
             for datname, usename, age, current_query in cur.fetchall():
-                self.log.fatal(
-                    "%s has transaction by %s open %s",
-                    datname, usename, age)
-                success = False
+                if usename in BAD_USERS:
+                    self.log.info(
+                        "%s has transactions by %s open %s (ignoring)",
+                        datname, usename, age)
+                else:
+                    self.log.fatal(
+                        "%s has transaction by %s open %s",
+                        datname, usename, age)
+                    success = False
         if success:
             self.log.info("No long running transactions detected.")
         return success
@@ -247,6 +278,13 @@
         else:
             return True
 
+    def report_patches(self):
+        """Report what patches are due to be applied from this tree."""
+        con = self.lpmain_master_node.con
+        upgrade.log = self.log
+        for patch_num, patch_file in upgrade.get_patchlist(con):
+            self.log.info("%s is pending", os.path.basename(patch_file))
+
     def check_all(self):
         """Run all checks.
 
@@ -257,6 +295,8 @@
             # to pg_stat_activity
             return False
 
+        self.report_patches()
+
         success = True
         if not self.check_replication_lag():
             success = False
@@ -312,6 +352,9 @@
                         self.log.fatal(
                             "Unable to kill %s [%s] on %s",
                             usename, procpid, datname)
+                    elif usename in BAD_USERS:
+                        self.log.info(
+                            "Killed %s [%s] on %s", usename, procpid, datname)
                     else:
                         self.log.warning(
                             "Killed %s [%s] on %s", usename, procpid, datname)