← Back to team overview

maria-developers team mailing list archive

Re: MDEV-4956 - Reduce usage of LOCK_open: TABLE_SHARE::tdc.used_tables

 

Hi Sergei,

below are results of "operf -e LLC_MISSES:10000:0x41 --pid `pidof mysqld`".
Tested with performance schema and general log off. tc_acquire_table() seem
to be inlined, so it appears as tdc_acquire_share().

Also note that my_malloc_size_cb_func() looks like another bottleneck.

Misses summary for tc_acquire_table()/tc_release_table()
--------------------------------------------------------
4.5408 + 3.0090= 7.5498% (10.0)
2.9916 + 2.6908= 5.6824% (10.0 + MDEV4956)
2.3502 + 1.7159= 4.0661% (10.0 + MDEV4956 + no unused_tables)

10.0 (rev.3912)
---------------
CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated)
Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples  %        image name               symbol name
40292    34.8688  no-vmlinux               /no-vmlinux
21638    18.7256  libpthread-2.15.so       pthread_mutex_lock
6718      5.8138  libpthread-2.15.so       pthread_mutex_unlock
5247      4.5408  mysqld                   tc_release_table(TABLE*)
3477      3.0090  mysqld                   tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**)
3410      2.9510  mysqld                   TABLE::init(THD*, TABLE_LIST*)
3316      2.8697  mysqld                   my_malloc_size_cb_func
3126      2.7053  mysqld                   open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
2360      2.0424  mysqld                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
2254      1.9506  libpthread-2.15.so       pthread_rwlock_unlock
2152      1.8623  libpthread-2.15.so       __lll_lock_wait
1867      1.6157  mysqld                   heap_info
1525      1.3197  mysqld                   tdc_refresh_version()
1441      1.2470  mysqld                   read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*)
1374      1.1891  mysqld                   make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1218      1.0541  mysqld                   get_lock_data(THD*, TABLE**, unsigned int, unsigned int)
1212      1.0489  mysqld                   THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int)
1062      0.9191  mysqld                   _my_thread_var
945       0.8178  libc-2.15.so             __memset_sse2
875       0.7572  mysqld                   thr_unlock
724       0.6266  mysqld                   lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
673       0.5824  mysqld                   thr_multi_lock


10.0 + MDEV4956
---------------
CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated)
Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples  %        image name               symbol name
43070    35.7909  no-vmlinux               /no-vmlinux
22661    18.8311  libpthread-2.15.so       pthread_mutex_lock
7960      6.6147  libpthread-2.15.so       pthread_mutex_unlock
3600      2.9916  mysqld                   tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**)
3498      2.9068  mysqld                   my_malloc_size_cb_func
3238      2.6908  mysqld                   tc_release_table(TABLE*)
3189      2.6500  mysqld                   TABLE::init(THD*, TABLE_LIST*)
2611      2.1697  libpthread-2.15.so       __lll_lock_wait
2414      2.0060  mysqld                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
2360      1.9611  mysqld                   open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
2250      1.8697  mysqld                   heap_info
1889      1.5697  mysqld                   read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*)
1557      1.2939  mysqld                   make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1378      1.1451  mysqld                   THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int)
1319      1.0961  mysqld                   get_lock_data(THD*, TABLE**, unsigned int, unsigned int)
1293      1.0745  mysqld                   thr_multi_lock
1267      1.0529  libpthread-2.15.so       pthread_rwlock_unlock
1211      1.0063  mysqld                   _my_thread_var
1136      0.9440  libc-2.15.so             __memset_sse2
887       0.7371  mysqld                   thr_unlock
736       0.6116  mysqld                   free_root
698       0.5800  mysqld                   tdc_refresh_version()


10.0 + MDEV4956 + no unused_tables
----------------------------------
CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated)
Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples  %        image name               symbol name
41828    35.5653  no-vmlinux               /no-vmlinux
22377    19.0266  libpthread-2.15.so       pthread_mutex_lock
8219      6.9884  libpthread-2.15.so       pthread_mutex_unlock
3578      3.0423  mysqld                   my_malloc_size_cb_func
3244      2.7583  mysqld                   TABLE::init(THD*, TABLE_LIST*)
2764      2.3502  mysqld                   tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**)
2630      2.2362  libpthread-2.15.so       __lll_lock_wait
2570      2.1852  mysqld                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
2407      2.0466  mysqld                   open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
2233      1.8987  mysqld                   heap_info
2027      1.7235  mysqld                   read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*)
2018      1.7159  mysqld                   tc_release_table(TABLE*)
1537      1.3069  mysqld                   make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1404      1.1938  mysqld                   THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int)
1368      1.1632  mysqld                   _my_thread_var
1343      1.1419  mysqld                   thr_multi_lock
1284      1.0918  mysqld                   get_lock_data(THD*, TABLE**, unsigned int, unsigned int)
1213      1.0314  libc-2.15.so             __memset_sse2
1162      0.9880  libpthread-2.15.so       pthread_rwlock_unlock
799       0.6794  mysqld                   thr_unlock
753       0.6403  mysqld                   tdc_refresh_version()
717       0.6096  mysqld                   THD::set_open_tables(TABLE*)

Regards,
Sergey

On Mon, Sep 16, 2013 at 08:50:49PM +0400, Sergey Vojtovich wrote:
> Hi Sergei,
> 
> just found another interesting test result. I added dummy LOCK_table_share mutex
> lock and unlock to tc_acquire_table() and tc_release_table() (before locking
> LOCK_open), just to measure pure mutex wait time.
> 
> Test execution time: 45s
> LOCK_open wait time: 34s
> LOCK_table_share wait time: 0.8s
> 
> +--------------------------------------------------------+------------+----------------+
> | event_name                                             | count_star | sum_timer_wait |
> +--------------------------------------------------------+------------+----------------+
> | wait/synch/mutex/sql/LOCK_open                         |     585690 | 34298972259258 |
> | wait/synch/mutex/mysys/THR_LOCK::mutex                 |     585604 |  4560420039042 |
> | wait/synch/mutex/sql/TABLE_SHARE::tdc.LOCK_table_share |     585710 |   794564626359 |
> | wait/synch/rwlock/sql/LOCK_tdc                         |     290940 |   237751940139 |
> | wait/synch/mutex/sql/THD::LOCK_thd_data                |    1838668 |   219829105251 |
> | wait/synch/rwlock/innodb/hash table locks              |     683395 |   159792339294 |
> | wait/synch/rwlock/innodb/btr_search_latch              |     290892 |   138915354207 |
> | wait/synch/mutex/innodb/trx_sys_mutex                  |      62940 |    78334973451 |
> | wait/synch/rwlock/innodb/index_tree_rw_lock            |     167822 |    49323455349 |
> | wait/synch/rwlock/sql/MDL_lock::rwlock                 |      41970 |    31436713938 |
> +--------------------------------------------------------+------------+----------------+
> 
> Regards,
> Sergey
> 
> On Mon, Sep 16, 2013 at 04:46:41PM +0400, Sergey Vojtovich wrote:
> > Hi Sergei,
> > 
> > On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
> > > Hi, Sergey!
> > > 
> > > On Sep 13, Sergey Vojtovich wrote:
> > > > Hi Sergei,
> > > > 
> > > > comments inline and a question: 10.0 throughput is twice lower than 5.6
> > > > in a specific case. It is known to be caused by tc_acquire_table() and
> > > > tc_release_table(). Do we want to fix it? If yes - how?
> > > 
> > > How is it caused by tc_acquire_table/tc_release_table?
> > Threads spend a lot of time waiting for LOCK_open in these functions.
> > Because protected by LOCK_open code takes a lot of time to execute.
> > 
> > > In what specific case?
> > The case is: many threads access one table (read-only OLTP).
> > 
> > > 
> > > > > > > Why per-share lists are updated under the global mutex?
> > > > > > Alas, it doesn't solve CPU cache coherence problem.
> > > > > It doesn't solve CPU cache coherence problem, yes.
> > > > > And it doesn't help if you have only one hot table.
> > > > > But it certainly helps if many threads access many tables.
> > > > Ok, let's agree to agree: it will help in certain cases. Most probably it
> > > > won't improve situation much if all threads access single table.
> > > 
> > > Of course.
> > > 
> > > > We could try to ensure that per-share mutex is on the same cache line as
> > > > free_tables and used_tables list heads. In this case I guess
> > > > mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into
> > > > CPU cache along with mutex structure. OTOH we still have to read per-TABLE
> > > > prev/next pointers. And in 5.6 per-partition mutex should less frequently
> > > > jump out of CPU cache than our per-share mutex. Worth trying?
> > > 
> > > Did you benchmark that these cache misses are a problem?
> > > What is the main problem that impacts the performance?
> > We (Axel and me) did a lot of different benchmarks before we concluded
> > cache misses to be the main problem. Please let me known if you're interested
> > in specific results - we either find them in benchmark archives or benchmark
> > again.
> > 
> > One of interesting results I just found is as following...
> > 10.0.4, read-only OLTP, 64 threads, tps ~10000
> > +---------------------------------------------+------------+-----------------+
> > | event_name                                  | count_star | sum_timer_wait  |
> > +---------------------------------------------+------------+-----------------+
> > | wait/synch/mutex/sql/LOCK_open              |    2784632 | 161835901661916 |
> > | wait/synch/mutex/mysys/THR_LOCK::mutex      |    2784556 |  28804019775192 |
> > ...skip...
> > 
> > Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time
> > differs ~6x.
> > 
> > Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait
> > go down from 161s to 100s.
> > 
> > Regards,
> > Sergey
> > 
> > _______________________________________________
> > Mailing list: https://launchpad.net/~maria-developers
> > Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> > Unsubscribe : https://launchpad.net/~maria-developers
> > More help   : https://help.launchpad.net/ListHelp
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp


References