pbxt-discuss team mailing list archive
-
pbxt-discuss team
-
Mailing list archive
-
Message #00121
Re: PBXT "hangs" sometimes but regularly
Hi Paul, Erkan
I will open a bug report and try to reproduce it.
BR,
Vlad
> -----Original Message-----
> From: Paul McCullagh [mailto:paul.mccullagh@xxxxxxxxx]
> Sent: Thursday, December 02, 2010 4:32 PM
> To: Vladimir Kolesnikov
> Cc: PBXT Discuss; erkan yanar
> Subject: Re: [Pbxt-discuss] PBXT "hangs" sometimes but regularly
>
> Hi Vlad,
>
> There are a number of things which are strange about this test.
>
> Firstly the flush takes 25 minutes. From the xtstat dump we can see
> that the main flushing on the index file stops at 101202 13:00:13.
> This is correct according to the "select now()" below (12:35:46 + 24
> minutes).
>
> This may not be the problem, but rather this:
>
> 1. Checkpointing does not seem to be happening, because to-chkpt is at
> 14.9G, and seems to remain there.
>
> The lack of checkpoint may be explained be second bit of strange
> activity:
>
> 2. Lots of little flusing on the ilog, but no update to the index data:
>
> ind ind ind ind ind ind ind ind ind ilog ilog
> ilog ilog ilog
> in out rd.ms wr.ms syncs/ms hits miss %use %dty in out
> rd.ms wr.ms syncs/ms
> 0 0 0 0 0/0 0 0 100 0.0 0
> 54 0 0 6/3
> 0 0 0 0 0/0 0 0 100 0.0 0
> 36 0 0 4/2
> 0 0 0 0 0/0 0 0 100 0.0 0
> 54 0 0 6/3
>
> This is very strange and need to be investigated. Basically, the ilog
> should be written, then the ilog is read, and the index file is
> written. That normally concludes the flushing of the index.
>
> It looks like the checkpointer thread is in a loop here, and does not
> get to the end of flushing the index, and so it does not get to the
> end of the checkpoint.
>
> @Vlad or Erkan, could you please report this as a bug, with the
> information we have so far. Next we will have to see if we can repeat
> this strange index "non-flush" behavior. Maybe using a similar test,
> but a little less data.
>
> On Dec 2, 2010, at 12:31 PM, erkan yanar wrote:
>
> > Hi Paul,
> >
> > I repeatet it for you;)
> >
> > mysql> select now(); truncate rein;
> > +---------------------+
> > | now() |
> > +---------------------+
> > | 2010-12-02 12:35:46 |
> > +---------------------+
> > 1 row in set (0.00 sec)
> >
> > Query OK, 0 rows affected (24 min 49.47 sec)
> >
> > So it was successfull.
> >
> > The xtstat-ouptut is attached. So maybe yo see something that might
> > helpfull.
> >
> > Regards
> > Erkan
> >
> > P.S.
> > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> > avgrq-sz avgqu-sz await svctm %util
> > dm-0 0.00 0.00 0.00 666.00 0.00 9040.00
> > 13.57 186.48 327.86 1.50 100.00
> >
> >
> > On Thu, Dec 02, 2010 at 09:13:48AM +0100, Paul McCullagh wrote:
> >> Hi Erkan,
> >>
> >> Yes. Strange pause. But I cannot see what the engine is doing in
> that
> >> time.
> >>
> >> There must be activity in some missing statistics, for example,
> >> record, row or index file writing.
> >>
> >> Unfortunately, when displaying all statistics, the output is very
> >> wide. What xtstat needs is a feature which logs all statistics but
> >> just displays what you select.
> >>
> >> Then, if you see a problem that does not show up in the displayed
> >> statistics you can still go back and check the others at the time.
> >>
> >> On Nov 30, 2010, at 3:05 AM, erkan yanar wrote:
> >>
> >>> Moin Paul
> >>>
> >>> On Mon, Nov 29, 2010 at 10:23:38AM +0100, Paul McCullagh wrote:
> >>>>
> >>>>
> >>>> The problem here is the index cache is full, and the thread which
> >>>> is
> >>>> flushing the index is hanging in sync (i.e. flush to disk). The
> >>>> rest
> >>>> of the system is hanging because it needs a clean page in the
> >>>> index.
> >>>>
> >>>> I did quite a bit of work to improve this in 1.1 and 1.5.
> >>>>
> >>>> But in general, if the index cache is really large, it need to be
> >>>> flushed more regularly to ensure that too many changes do not
> >>>> accumulate. Checkpointing should do this, but there is a bug in
> >>>> 1.0.
> >>>>
> >>>> You checkpoint frequency is 128M, xtstat says to-chkpt = 1049M.
> >>>> This
> >>>> means the checkpoint has not occurred as often as he should.
> >>>>
> >>>> This problem should be fixed in 1.1.
> >>>>
> >>>
> >>> While 1.5.02 Beta handles IO a lot better than 1.0.11-7, you may be
> >>> interested in that checkpointing issue:
> >>> pbxt_checkpoint_frequency = 67108864
> >>>
> >>> time time xact xact xact xact stat stat ilog
> >>> ilog ilog ilog ilog xlog xlog xlog xlog xlog xlog xlog
> >>> xlog to to to sweep row row row row
> >>> curr msec commt rollb waits dirty read write in out
> >>> rd.ms wr.ms syncs/ms in out rd.ms wr.ms syncs sy.ms hits
> miss
> >>> chkpt write sweep waits sel ins upd del
> >>> 101130 03:45:13 5001 198t 0 0 10.9t 0 198t 320M
> >>> 0 83 0 0/0 0 21.1M 0 12 9 3576 69.6t
> >>> 0 813M 9833K 0 13 0 198t 0 0
> >>> 101130 03:45:18 5001 9501 0 0 5148 0 9501 0
> >>> 0 0 0 0/0 0 0 0 0 0 5001 0
> >>> 0 813M 9833K 0 1 0 9501 0 0
> >>> 101130 03:45:23 5001 0 0 0 5148 0 0 0
> >>> 0 0 0 0/0 0 0 0 0 0 5001 0
> >>> 0 813M 9833K 0 0 0 0 0 0
> >>> 101130 03:45:28 5001 80.1t 0 0 13.8t 0 80.1t 0
> >>> 0 0 0 0/0 0 8192K 0 4 1 4993 1
> >>> 0 814M 17.6M 115M 5 0 80.1t 0 0
> >>> 101130 03:45:33 5001 179t 0 0 10.3t 0 179t 0
> >>> 0 0 0 0/0 1244K 19.7M 0 8 20 4395 119t
> >>> 39 837M 6772K 0 13 0 179t 0 0
> >>> 101130 03:45:38 5001 12.7t 0 0 9084 0 12.7t 0
> >>> 0 0 0 0/0 0 0 0 0 0 5001 5507
> >>> 0 837M 3756K 0 1 0 12.7t 0 0
> >>> 101130 03:45:43 5001 78.4t 0 0 8907 0 78.4t 0
> >>> 0 0 0 0/0 0 8192K 0 3 1 4994 18.1t
> >>> 0 840M 8192K 0 5 0 78.4t 0 0
> >>> 101130 03:45:48 5001 156t 0 0 4410 0 156t 0
> >>> 0 0 0 0/0 0 16.0M 0 8 2 4985 67.5t
> >>> 0 856M 8192K 0 11 0 156t 0 0
> >>> 101130 03:45:53 5001 78.9t 0 0 7213 0 78.9t 0
> >>> 0 0 0 0/0 0 8192K 0 6 1 4991 1
> >>> 0 864M 16.0M 0 7 0 78.9t 0 0
> >>>
> >>>
> >>> Best Regards
> >>> Erkan
> >>>
> >>> --
> >>> über den grenzen muß die freiheit wohl wolkenlos sein
> >>
> >
> > --
> > über den grenzen muß die freiheit wohl wolkenlos sein
> > <xtstat-truncate.log.gz>
References