launchpad-dev team mailing list archive

Thread
Date

Re: DB replication lag events

To: Steve McInerney <steve.mcinerney@xxxxxxxxxxxxx>
From: Stuart Bishop <stuart.bishop@xxxxxxxxxxxxx>
Date: Wed, 25 Aug 2010 09:09:34 +0700
Cc: LOSAS <losas@xxxxxxxxxxxxx>, Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <1282698540.5093.44.camel@dogmatix>
Sender: stuart@xxxxxxxxxxxxxxxx

On Wed, Aug 25, 2010 at 8:09 AM, Steve McInerney
<steve.mcinerney@xxxxxxxxxxxxx> wrote:
> Heya Stub,
>
> had a couple of DB replication lag alerts lately. Well, in my case two
> this week at what 'felt' like similar times.
>
> Below is the list of errors from nagios (times are BST, +0100) from the
> past week
>
> Event Start Time    -- Event End Time       Event Duration  Event/State
> Type       Event/State Information
> 2010-08-13 19:36:47 -- 2010-08-13 19:47:17  0d 0h 10m 30s SERVICE CRITICAL (HARD)  ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-14 03:03:50 -- 2010-08-14 03:26:50  0d 0h 23m 0s  SERVICE CRITICAL (HARD)  ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-15 23:45:11 -- 2010-08-15 23:50:51  0d 0h  5m 40s SERVICE CRITICAL (HARD)  ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-19 20:35:45 -- 2010-08-19 20:54:25  0d 0h 18m 40s SERVICE CRITICAL (HARD)  ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-22 20:09:25 -- 2010-08-22 20:27:25  0d 0h 18m 0s  SERVICE CRITICAL (HARD)  ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-24 02:48:05 -- 2010-08-24 03:24:55  0d 0h 36m 50s SERVICE CRITICAL (HARD)  ERROR: Replication lag above threshold of 300 seconds or slave not recognised
> 2010-08-25 01:34:22 -- 2010-08-25 01:36:10  0d 0h  1m 48s+
>
>
> So the one I saw yesterday was a bit later, but a biggish chunk ~35
> mins.
>
>
> Doing some log surfing:
>
> <@:9027> 2010-08-24 02:45:34 BST LOG:  checkpoint starting: time
> <@:9027> 2010-08-24 02:46:53 BST LOG:  checkpoint complete: wrote 2025
> buffers (0.3%); 0 transaction log file(s) added, 0 removed, 7 recycled;
> write=79.033 s, sync=0.051 s, total=79.106 s
>
> That being the postgres writer process; but I see a bunch of similar log
> matches happening *all* the time. So somewhat inclined to discount
> cause/effect here. :-/

Lag is caused by updates happening faster than we can replicate them,
so seeing high write activity is a symptom. Using
utilities/report-database-stats.py tells me that just before the high
lag, we had a spike in pofiletranslator write activity, probably from
the pofilestats_daily or maybe poimport database user.

postgres@wildcherry:~/launchpad/utilities$ ./report-database-stats.py
--from='2010-08-24 01:00' --until='2010-08-24 01:45'

[...]

== Most Written Tables ==

                                sl_log_1 ||     130.31 tuples/sec
                               sl_seqlog ||     124.54 tuples/sec
                        pofiletranslator ||      75.73 tuples/sec
                              oauthnonce ||      43.48 tuples/sec
                      translationmessage ||      15.36 tuples/sec
                                  pofile ||       6.18 tuples/sec
                              sl_confirm ||       6.08 tuples/sec
                        libraryfilealias ||       4.45 tuples/sec
                libraryfiledownloadcount ||       3.94 tuples/sec
                          branchrevision ||       3.36 tuples/sec
                                   karma ||       3.14 tuples/sec
                           potranslation ||       3.01 tuples/sec
                bugnotificationrecipient ||       2.95 tuples/sec
                           revisioncache ||       2.06 tuples/sec
                  databasereplicationlag ||       1.97 tuples/sec

== Most Active Users ==

                                   lpnet ||      83.80% CPU
                                  xmlrpc ||      76.46% CPU
                                    edge ||      65.79% CPU
                       pofilestats_daily ||      52.90% CPU
                                 lucille ||      46.64% CPU
                                poimport ||      32.97% CPU
                                   slony ||      30.27% CPU
                                postgres ||      18.50% CPU
                                   fiera ||      13.35% CPU
      translations_import_queue_gardener ||       8.76% CPU
                      distributionmirror ||       6.70% CPU
                            checkwatches ||       4.63% CPU
                     generateppahtaccess ||       4.45% CPU
                                  lagmon ||       1.40% CPU
                                uploader ||       0.51% CPU

I'm not sure what this script is doing. Perhaps it is committing all
its changes in a single transaction? Perhaps it is sometimes touching
far more rows than expected?



-- 
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/

Follow ups

Re: DB replication lag events
From: Danilo Šegan, 2010-08-25
Re: DB replication lag events
From: Stuart Bishop, 2010-08-25