launchpad-reviewers team mailing list archive
-
launchpad-reviewers team
-
Mailing list archive
-
Message #29986
[Merge] ~cjwatson/launchpad:charm-librarian-doc-migration into launchpad:master
Colin Watson has proposed merging ~cjwatson/launchpad:charm-librarian-doc-migration into launchpad:master.
Commit message:
charm: Document launchpad-librarian migration process
Requested reviews:
Launchpad code reviewers (launchpad-reviewers)
For more details, see:
https://code.launchpad.net/~cjwatson/launchpad/+git/launchpad/+merge/442615
This is loosely based on our last production migration, although that was lifting and shifting a pre-cloud deployment into a manual VM so a number of the details differ.
--
Your team Launchpad code reviewers is requested to review the proposed merge of ~cjwatson/launchpad:charm-librarian-doc-migration into launchpad:master.
diff --git a/charm/launchpad-librarian/README.md b/charm/launchpad-librarian/README.md
index 6567dc0..bd2a657 100644
--- a/charm/launchpad-librarian/README.md
+++ b/charm/launchpad-librarian/README.md
@@ -23,3 +23,75 @@ You will normally want to mount a persistent volume on
`/srv/launchpad/librarian/`. (Even when writing uploads to Swift, this is
currently used as a temporary spool; it is therefore not currently valid to
deploy more than one unit of this charm.)
+
+## Migrating between instances
+
+Only one instance of the librarian may be active at any one time, and very
+little downtime is acceptable on production. This means that we have to be
+especially careful when redeploying. The general procedure is as follows:
+
+1. Deploy a new unit with `active=false`. This will run the librarian in
+ more or less a read-only mode: downloads are possible, and cron jobs that
+ would modify the database, the contents of the librarian, or the contents
+ of Swift are disabled. If uploads happen they will be spooled locally,
+ but that's low-risk since only Launchpad itself uploads to the librarian.
+ This allows testing connectivity.
+
+1. Ensure that a Ceph volume is mounted persistently on
+ `/srv/launchpad/librarian/`. On production this should be a 2 TiB volume
+ to allow some breathing room if uploading to Swift is temporarily
+ unavailable.
+
+1. On the librarian unit, run `systemctl status
+ launchpad-librarian@1.service` to ensure that the librarian is running.
+ (If it crashes then it will restart automatically, so make sure that it's
+ been running for at least a few minutes.) You may need to ensure that
+ the appropriate firewall rules exist to give it access to the Launchpad
+ database, Launchpad's XML-RPC appserver, Swift, and some other details;
+ for Canonical's production deployment, it should be enough to add the new
+ unit to `services/lp/librarian/servers` in our firewall configuration.
+
+1. Find a librarian URL of something at least a day old from Launchpad (the
+ `.dsc` of an older source package in Ubuntu will do) and check that you
+ can fetch it from any of the public download ports of the new unit.
+ There is one public download port per worker, assigned sequentially
+ starting from `port_download_base`. This checks basic database and Swift
+ connectivity.
+
+1. Use `rsync` to copy the temporary spool from the old unit; on pre-Juju
+ production instances this lived in
+ `/srv/launchpadlibrarian.net/production/librarian/`, while on instances
+ of this charm it lives in `/srv/launchpad/librarian/`. The `rsync`
+ process should run as the `launchpad` user on the new unit, and should
+ _not_ use the `--delete` option; extra copies of files aren't a problem,
+ and will be cleaned up by automatic garbage collection after the
+ migration is complete. Keep this running in a loop throughout the
+ migration; once it has caught up it should only take a minute or so per
+ iteration.
+
+1. As `stg-launchpad@launchpad-bastion-ps5.internal`, run `lpndt
+ service-stop cron-fdt` to disable all cron jobs, then (after a minute)
+ `lpndt service-stop buildd-manager` to stop `buildd-manager`.
+
+1. Comment out the `librarian-gc` and `librarian-feed-swift` cron jobs on
+ the old unit (if it was deployed using this charm and is in a different
+ Juju application, you can do this by setting `active=false` using Juju),
+ and wait for the associated processes to stop.
+
+1. Switch the `haproxy` frontends over to the new unit. On production,
+ you'll need to update the IP addresses of the `dl_librarian_[1-6]`,
+ `ul_librarian_[1-6]`, `dl_librarian_internal_[1-6]`, and
+ `ul_librarian_internal_[1-6]` servers.
+
+1. Ensure that librarian access via the web frontend still works.
+
+1. Set `active=true` on the new unit using Juju.
+
+1. Check that logs from `librarian-feed-swift` (and later `librarian-gc`,
+ which only runs daily) look good.
+
+1. Stop the `rsync` loop.
+
+1. As `stg-launchpad@launchpad-bastion-ps5.internal`, run `lpndt
+ service-start buildd-manager` to start `buildd-manager, then (after a
+ minute) `lpndt service-start cron-fdt` to enable all cron jobs.