← Back to team overview

pbxt-discuss team mailing list archive

Re: PBXT "hangs" sometimes but regularly

 

Hi Erkan,

On Nov 29, 2010, at 4:30 AM, erkan yanar wrote:


Playing around from time to time PBXT stops doing anything:

xtstat:

time xact row rec rec rec rec rec rec rec ind ind ind ind ind ind ilog ilog ilog xlog xlog xlog xlog xlog xlog data data data data to to to xact msec commt ins in out syncs/ms hits miss frees %use in out syncs/ms hits miss %use in out syncs/ms in out syncs msec hits miss in out syncs msec chkpt write sweep dirty 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 time xact row rec rec rec rec rec rec rec ind ind ind ind ind ind ilog ilog ilog xlog xlog xlog xlog xlog xlog data data data data to to to xact msec commt ins in out syncs/ms hits miss frees %use in out syncs/ms hits miss %use in out syncs/ms in out syncs msec hits miss in out syncs msec chkpt write sweep dirty 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865


The problem here is the index cache is full, and the thread which is flushing the index is hanging in sync (i.e. flush to disk). The rest of the system is hanging because it needs a clean page in the index.

I did quite a bit of work to improve this in 1.1 and 1.5.

But in general, if the index cache is really large, it need to be flushed more regularly to ensure that too many changes do not accumulate. Checkpointing should do this, but there is a bug in 1.0.

You checkpoint frequency is 128M, xtstat says to-chkpt = 1049M. This means the checkpoint has not occurred as often as he should.

This problem should be fixed in 1.1.

While I suggest there should be something done.

Processlist:

| 311367 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311366 | root | localhost | test | Query | 690 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311368 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311369 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311372 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311373 | root | localhost | test | Query | 694 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311376 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311377 | root | localhost | test | Query | 698 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311379 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311380 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311382 | root | localhost | test | Query | 696 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311384 | root | localhost | test | Query | 699 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311385 | root | localhost | test | Query | 693 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311388 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311389 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311390 | root | localhost | test | Query | 696 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311391 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311392 | root | localhost | test | Query | 699 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311395 | root | localhost | test | Query | 687 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311399 | root | localhost | test | Query | 690 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311400 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311404 | root | localhost | test | Query | 694 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311405 | root | localhost | test | Query | 690 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311406 | root | localhost | test | Query | 694 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311407 | root | localhost | test | Query | 699 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311408 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311409 | root | localhost | test | Query | 688 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311411 | root | localhost | test | Query | 694 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311412 | root | localhost | test | Query | 694 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311414 | root | localhost | test | Query | 701 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311416 | root | localhost | test | Query | 696 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311419 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311422 | root | localhost | test | Query | 687 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311423 | root | localhost | test | Query | 690 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311424 | root | localhost | test | Query | 696 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311425 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311426 | root | localhost | test | Query | 691 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311427 | root | localhost | test | Query | 689 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id |



After a while it starts doing some tings again:
time xact row rec rec rec rec rec rec rec ind ind ind ind ind ind ilog ilog ilog xlog xlog xlog xlog xlog xlog data data data data to to to xact msec commt ins in out syncs/ms hits miss frees %use in out syncs/ms hits miss %use in out syncs/ms in out syncs msec hits miss in out syncs msec chkpt write sweep dirty 5000 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5000 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 1049M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 1/1307 0 0 100 0 2753M 2/1 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865 5001 0 0 0 40 3/1300 0 0 0 76.0 0 0 0/0 0 0 100 0 0 1/8714 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865 5001 0 0 0 0 0/0 0 0 0 76.0 0 2906M 0/474 0 0 100 2758M 9 2/24.4t 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865

ca. 110x
5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865
then

time xact row rec rec rec rec rec rec rec ind ind ind ind ind ind ilog ilog ilog xlog xlog xlog xlog xlog xlog data data data data to to to xact msec commt ins in out syncs/ms hits miss frees %use in out syncs/ms hits miss %use in out syncs/ms in out syncs msec hits miss in out syncs msec chkpt write sweep dirty 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/5001 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 888M 400M 85.6M 3865 5001 8 0 0 38.7M 0/0 13.0t 2 0 76.0 98304 0 1/438 17.9t 6 100 0 274K 3/1 0 130K 2 3320 2587t 0 0 0 0 0 358M 296M 85.7M 3871 5001 0 0 0 47.2M 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 5001 3027t 0 0 0 0 0 358M 166M 85.7M 3871 5001 4 0 0 44.0M 0/0 6482 0 0 76.0 65536 288K 1/1 8888 4 100 274K 16.7M 4/10.2t 0 66560 3 852 2013t 0 0 0 0 0 358M 32.7M 85.8M 3874 5001 0 0 0 14.2M 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 852t 0 0 0 0 0 358M 5550 85.8M 3874 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 358M 5550 85.8M 3874 5001 42 0 0 0 0/0 63.8t 9 0 76.0 400K 89.3M 14/261 87.3t 25 100 84.8M 83.1M 56/17.4t 0 620K 16 15 16 0 0 0 0 0 359M 617K 86.4M 3902 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 359M 617K 86.4M 3902 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 359M 617K 86.4M 3902 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 0 0 0 0 0 0 0 359M 617K 86.4M 3902 5001 30 0 0 0 0/0 1017t 5 0 76.0 176K 70.8M 17/1489 2987t 11 100 67.3M 292M 68/17.3t 0 10.2M 19 2321 20 0 0 0 0 0 361M 10.8M 69.3M 3104 5001 0 0 0 0 0/0 247t 0 0 76.0 0 0 0/0 741t 0 100 0 0 0/0 0 0 0 5001 0 0 0 0 0 0 361M 10.8M 64.9M 2898 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 5001 0 0 0 0 0 0 361M 10.8M 64.9M 2898 5001 0 0 0 0 0/0 0 0 0 76.0 0 0 0/0 0 0 100 0 0 0/0 0 0 0 5001 0 0 0 0 0 0 361M 10.8M 64.9M 2898 5001 0 0 0 0 0/0 1850t 0 0 76.0 65536 0 0/0 5557t 4 100 0 0 0/0 0 26.9M 21 4268 45.1t 0 0 0 0 0 382M 37.7M 53.1M 1352 5001 0 0 0 0 0/0 964t 0 0 76.0 16384 254M 0/3361 2896t 1 100 241M 9 2/22.2t 0 14.0M 5 3770 36.3t 0 0 0 0 0 394M 51.7M 47.4M 549 5001 0 0 0 0 0/0 377t 0 0 76.0 288K 0 0/5001 1145t 18 100 0 0 0/0 0 8192K 1 4993 1 0 0 0 0 0 400M 59.7M 45.7M 234

This activity above is checkpointing. This pauses in-between where there is not flushing and no other activity are a bug in 1.0, as far as I can remember.

and even the processlist shinks:
| 311395 | root | localhost | test | Query | 1781 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311409 | root | localhost | test | Query | 1782 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311419 | root | localhost | test | Query | 1785 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311422 | root | localhost | test | Query | 1781 | Updating | UPDATE rein SET id3=round(rand()*10000) where id2=@id | | 311429 | root | localhost | test | Sleep | 750 | | NULL |


MariaDB [test]> show global variables like '%pbxt%';
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| pbxt_auto_increment_mode     | 0     |
| pbxt_checkpoint_frequency    | 128M  |
| pbxt_data_file_grow_size     | 50M   |
| pbxt_data_log_threshold      | 256M  |
| pbxt_flush_log_at_trx_commit | 0     |
| pbxt_garbage_threshold       | 50    |
| pbxt_index_cache_size        | 8G    |
| pbxt_log_buffer_size         | 512M  |
| pbxt_log_cache_size          | 1G    |
| pbxt_log_file_count          | 3     |
| pbxt_log_file_threshold      | 32MB  |
| pbxt_max_threads             | 1507  |
| pbxt_offline_log_function    | 0     |
| pbxt_record_cache_size       | 12G   |
| pbxt_row_file_grow_size      | 10M   |
| pbxt_support_xa              | ON    |
| pbxt_sweeper_priority        | 2     |
| pbxt_transaction_buffer_size | 8M    |
--->> Wildest guessing as I didn't found best practices on tuning them.

The settings are fine. In 1.1 I would recommend decreasing (which means using a higher value) the checkpoint frequency. But in 1.0 this value is not correctly used anyway.

MariaDB [test]> select version();
+-------------------+
| version()         |
+-------------------+
| 5.2.3-MariaDB-log |

MariaDB [test]> select * from information_schema.PLUGINS where PLUGIN_NAME='PBXT'\G
*************************** 1. row ***************************
          PLUGIN_NAME: PBXT
       PLUGIN_VERSION: 0.1
        PLUGIN_STATUS: ACTIVE
          PLUGIN_TYPE: STORAGE ENGINE
  PLUGIN_TYPE_VERSION: 50203.0
       PLUGIN_LIBRARY: NULL
PLUGIN_LIBRARY_VERSION: NULL
        PLUGIN_AUTHOR: Paul McCullagh, PrimeBase Technologies GmbH
PLUGIN_DESCRIPTION: High performance, multi-versioning transactional engine
       PLUGIN_LICENSE: GPL
      PLUGIN_MATURITY: Gamma
  PLUGIN_AUTH_VERSION: 1.0.11-7 Pre-GA


# free -m
total used free shared buffers cached Mem: 96682 74052 22630 0 180 32567
-/+ buffers/cache:      41304      55378
Swap:         2047          0       2047

CPU:
2xX5550

Im clueless, how to interpret this behavior

Basically the system is filling up the cache and then hitting a wall. All background threads are way behind, mostly because of this (I would guess, assuming you have no long running transactions or SELECTs).

Unfortunately, all the work on flushing behavior was done in 1.1 and 1.5. For example, 1.1 include multiply flush threads for better throughput.

Best regards,

Paul








Follow ups

References