launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #04473
Re: DB replication lag events
On Wed, Aug 25, 2010 at 8:09 AM, Steve McInerney
<steve.mcinerney@xxxxxxxxxxxxx> wrote:
> Heya Stub,
>
> had a couple of DB replication lag alerts lately. Well, in my case two
> this week at what 'felt' like similar times.
>
> Below is the list of errors from nagios (times are BST, +0100) from the
> past week
>
> Event Start Time -- Event End Time Event Duration Event/State
> Type Event/State Information
> 2010-08-13 19:36:47 -- 2010-08-13 19:47:17 0d 0h 10m 30s SERVICE CRITICAL (HARD) ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-14 03:03:50 -- 2010-08-14 03:26:50 0d 0h 23m 0s SERVICE CRITICAL (HARD) ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-15 23:45:11 -- 2010-08-15 23:50:51 0d 0h 5m 40s SERVICE CRITICAL (HARD) ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-19 20:35:45 -- 2010-08-19 20:54:25 0d 0h 18m 40s SERVICE CRITICAL (HARD) ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-22 20:09:25 -- 2010-08-22 20:27:25 0d 0h 18m 0s SERVICE CRITICAL (HARD) ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-24 02:48:05 -- 2010-08-24 03:24:55 0d 0h 36m 50s SERVICE CRITICAL (HARD) ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-25 01:34:22 -- 2010-08-25 01:36:10 0d 0h 1m 48s+
>
>
> So the one I saw yesterday was a bit later, but a biggish chunk ~35
> mins.
>
>
> Doing some log surfing:
>
> <@:9027> 2010-08-24 02:45:34 BST LOG: checkpoint starting: time
> <@:9027> 2010-08-24 02:46:53 BST LOG: checkpoint complete: wrote 2025
> buffers (0.3%); 0 transaction log file(s) added, 0 removed, 7 recycled;
> write=79.033 s, sync=0.051 s, total=79.106 s
>
> That being the postgres writer process; but I see a bunch of similar log
> matches happening *all* the time. So somewhat inclined to discount
> cause/effect here. :-/
Lag is caused by updates happening faster than we can replicate them,
so seeing high write activity is a symptom. Using
utilities/report-database-stats.py tells me that just before the high
lag, we had a spike in pofiletranslator write activity, probably from
the pofilestats_daily or maybe poimport database user.
postgres@wildcherry:~/launchpad/utilities$ ./report-database-stats.py
--from='2010-08-24 01:00' --until='2010-08-24 01:45'
[...]
== Most Written Tables ==
sl_log_1 || 130.31 tuples/sec
sl_seqlog || 124.54 tuples/sec
pofiletranslator || 75.73 tuples/sec
oauthnonce || 43.48 tuples/sec
translationmessage || 15.36 tuples/sec
pofile || 6.18 tuples/sec
sl_confirm || 6.08 tuples/sec
libraryfilealias || 4.45 tuples/sec
libraryfiledownloadcount || 3.94 tuples/sec
branchrevision || 3.36 tuples/sec
karma || 3.14 tuples/sec
potranslation || 3.01 tuples/sec
bugnotificationrecipient || 2.95 tuples/sec
revisioncache || 2.06 tuples/sec
databasereplicationlag || 1.97 tuples/sec
== Most Active Users ==
lpnet || 83.80% CPU
xmlrpc || 76.46% CPU
edge || 65.79% CPU
pofilestats_daily || 52.90% CPU
lucille || 46.64% CPU
poimport || 32.97% CPU
slony || 30.27% CPU
postgres || 18.50% CPU
fiera || 13.35% CPU
translations_import_queue_gardener || 8.76% CPU
distributionmirror || 6.70% CPU
checkwatches || 4.63% CPU
generateppahtaccess || 4.45% CPU
lagmon || 1.40% CPU
uploader || 0.51% CPU
I'm not sure what this script is doing. Perhaps it is committing all
its changes in a single transaction? Perhaps it is sometimes touching
far more rows than expected?
--
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/
Follow ups