← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1649616] [NEW] Keystone Token Flush job does not complete in HA deployed environment

 

Public bug reported:

The Keystone token flush job can get into a state where it will never
complete because the transaction size exceeds the mysql galara
transaction size - wsrep_max_ws_size (1073741824).


Steps to Reproduce:
1. Authenticate many times
2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment
3. Observe errors in mysql.log indicating a transaction that is too large


Actual results:
Expired tokens are not actually flushed from the database without any errors in keystone.log.  Only errors appear in mysql.log.


Expected results:
Expired tokens to be removed from the database


Additional info:
It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed.  

Once the token flush job can not complete the token table will never
decrease in size and eventually the cloud will run out of disk space.

Furthermore the flush job will consume disk utilization resources.  This
was demonstrated on slow disks (Single 7.2K SATA disk).  On faster disks
you will have more capacity to generate tokens, you can then generate
the number of tokens to exceed the transaction size even faster.

Log evidence:
[root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log
2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434
2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241
2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504
2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823
2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202
[root@overcloud-controller-0 log]# tail mysqld.log 
161208  1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161208  1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161209  9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161209  9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161211  1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161211  1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161212  1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161212  1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161213  1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161213  1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2


Disk utilization issue graph is attached.  The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).

** Affects: keystone
     Importance: Undecided
         Status: New

** Attachment added: "Disk IO % util on Controller when Token Flush is running."
   https://bugs.launchpad.net/bugs/1649616/+attachment/4791197/+files/Token_flush-Disk_io.png

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1649616

Title:
  Keystone Token Flush job does not complete in HA deployed environment

Status in OpenStack Identity (keystone):
  New

Bug description:
  The Keystone token flush job can get into a state where it will never
  complete because the transaction size exceeds the mysql galara
  transaction size - wsrep_max_ws_size (1073741824).

  
  Steps to Reproduce:
  1. Authenticate many times
  2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment
  3. Observe errors in mysql.log indicating a transaction that is too large

  
  Actual results:
  Expired tokens are not actually flushed from the database without any errors in keystone.log.  Only errors appear in mysql.log.

  
  Expected results:
  Expired tokens to be removed from the database

  
  Additional info:
  It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed.  

  Once the token flush job can not complete the token table will never
  decrease in size and eventually the cloud will run out of disk space.

  Furthermore the flush job will consume disk utilization resources.
  This was demonstrated on slow disks (Single 7.2K SATA disk).  On
  faster disks you will have more capacity to generate tokens, you can
  then generate the number of tokens to exceed the transaction size even
  faster.

  Log evidence:
  [root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log
  2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434
  2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241
  2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504
  2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823
  2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202
  [root@overcloud-controller-0 log]# tail mysqld.log 
  161208  1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
  161208  1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2
  161209  9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
  161209  9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2
  161211  1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
  161211  1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2
  161212  1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
  161212  1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2
  161213  1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
  161213  1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2

  
  Disk utilization issue graph is attached.  The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).

To manage notifications about this bug go to:
https://bugs.launchpad.net/keystone/+bug/1649616/+subscriptions


Follow ups