enterprise-support team mailing list archive

Thread
Date
[Bug 1767105] [NEW] race condition on rmm for module ldap (ldap cache) - part II

To: enterprise-support@xxxxxxxxxxxxxxxxxxx
From: Rafael David Tinoco <rafael.tinoco@xxxxxxxxxxxxx>
Date: Thu, 26 Apr 2018 11:48:31 -0000
Reply-to: Bug 1767105 <1767105@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

[Impact]

 * Apache users using ldap module might face this if using multiple
threads and shared memory activated for apr memory allocator (default in
Ubuntu).

[Test Case]

 * Configure apache to use ldap module, for authentication e.g., and wait for the race condition to happen.
 * Analysis made out of a dump from a production environment.
 * Bug has been reported multiple times upstream in the past 10 years.

[Regression Potential]

 * ldap module has broken locking mechanism when using apr mem mgmt.
 * ldap would continue to have broken locking mechanism.
 * race conditions could still exist.
 * could could brake ldap module.
 * patch is upstreamed in next version to be released.

[Other Info]

ORIGINAL CASE DESCRIPTION:

This is related to LP: #1752683. The locking locking mechanism for LDAP
was fixed, since it is now obtained from server config merge like it was
supposed to. Problem is that it has, likely, a race condition on its
logic, causing ldap module to still fail in some conditions.

Problem summary:

apr_rmm_init acts as a relocatable memory management initialization

it is used in: mod_auth_digest and util_ldap_cache

>From the dump was brought to my knowledge, in the following sequence:

- util_ldap_compare_node_copy()
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_of_size()

Had a "cache->rmm_addr" with no lock at "find_block_of_size()"

cache->rmm_addr->lock { type = apr_anylock_none }

And an invalid "next" offset (out of rmm->base->firstfree).

This rmm_addr was initialized with NULL as a locking mechanism:

>From apr-utils:

apr_rmm_init()

    if (!lock) {	<-- 2nd argument to apr_rmm_init()
        nulllock.type = apr_anylock_none;	<--- found in the dump
        nulllock.lock.pm = NULL;
        lock = &nulllock;
    }

>From apache:

# mod_auth_digest

    sts = apr_rmm_init(&client_rmm,
                       NULL, /* no lock, we'll do the locking ourselves */
                       apr_shm_baseaddr_get(client_shm),
                       shmem_size, ctx);

# util_ldap_cache

        result = apr_rmm_init(&st->cache_rmm, NULL,
                              apr_shm_baseaddr_get(st->cache_shm), size,
                              st->pool);

It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it.
Without it, its up to the caller to guarantee that there are locks for rmm
synchronization (just like mod_auth_digest does, using global mutexes).

Because of that, there was a race condition in "find_block_of_size" and a call
touching "rmm->base->firstfree", possibly "move_block()", in a multi-threaded
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything).

In find_block_of_size:

    apr_rmm_off_t next = rmm->base->firstfree;

We have:

    rmm->base->firstfree
 Decimal:356400
 Hex:0x57030

But "next" turned into:

Name : next
 Decimal:8320808657351632189
 Hex:0x737973636970653d

Causing:

        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base +
next);

        if (blk->size == size)

To segfault.

Upstream bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=58483
https://bz.apache.org/bugzilla/show_bug.cgi?id=60296
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980#15

Upstream bugs fixed the "outer" lock, which is not obtained from the
config merge function in all conditions. Since apr_rmm_init() is called
with no lock, the caller should take care of the lockin. Unfortunately
the "outer" lock is not working as it seems:

---- This new bug explanation:

LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for
subsequent calls to APR with NULL locking (passed to APR_RMM_INIT).
After patch for this bug has been applied,
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683, end user
still complains about seg faults and core dumps show same issue: race
condition for rmm->base->firstfree in function "find_block_of_size".

In the dump, in find_block_of_size():

    apr_rmm_off_t next = rmm->base->firstfree;
    ...
    while(next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

blk gets value 0x5772e56b36226557 because "next" was corrupted (value:
0x57726573553d554f). This happens because the lock

    APR_ANYLOCK_LOCK(&rmm->lock);

in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.

For the sake of exercising possibilities, if mod_ldap is calling APR RMM
with external locking, it would be using LDAP_CACHE_LOCK. My current
stack trace is this:

Thread #19 7092 (Suspended : Container)	
	kill() at syscall-template.S:84 0x7ff7e9911767	
	<signal handler called>() at 0x7ff7e9cb7390	
	find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a	
	apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68	
	util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d	
	util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784	
	util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310	
	uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3	
	ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937	
	apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa	
	apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01	
	authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca	
	ap_run_auth_checker() at request.c:91 0x56127e692f00	
	ap_process_request_internal() at request.c:335 0x56127e695d57	
	ap_process_async_request() at http_request.c:408 0x56127e6b4690	
	ap_process_request() at http_request.c:445 0x56127e6b4850	
	ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e	
	ap_process_http_connection() at http_core.c:251 0x56127e6b091e	
	ap_run_process_connection() at connection.c:41 0x56127e6a6bf0	
	ap_process_connection() at connection.c:213 0x56127e6a7000	
	process_socket() at worker.c:631 0x7ff7e2f51f8b	
	worker_thread() at worker.c:990 0x7ff7e2f51f8b	
	start_thread() at pthread_create.c:333 0x7ff7e9cad6ba	
	clone() at clone.S:109 0x7ff7e99e341d	

Which means uldap_cache_compare() would have synchronized access to APR
RMM calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the
case as the lock doesn't seem to be acquired.

LDAP_CACHE_LOCK() translates into:

do {
    if (st->util_ldap_cache_lock)
        apr_global_mutex_lock(st->util_ldap_cache_lock);
} while (0);

After the change proposed for this bug (where "util_ldap_cache_lock"
would come from the ldap_merge_config), it seems that st has
util_ldap_cache_lock and util_ldap_cache all set:

Name : util_ldap_cache_lock
	Hex:0x7ff7ea75aee0

Name : util_ldap_cache
	Hex:0x7ff7e0e51038

Meaning that it got the ldap_cache and ldap_cache_lock from the merge
config function.

>From the mutex acquire logic, for the apr_global_mutex_lock() ->
apr_proc_mutex_lock():

apr_status_t apr_proc_mutex_lock(apr_proc_mutex_t *mutex)
{
    return mutex->meth->acquire(mutex);
}

And it would translate into:

st->util_ldap_cache_lock->proc_mutex->meth->acquire ==
proc_mutex_fcntl_acquire()

And from that logic:

static apr_status_t proc_mutex_fcntl_acquire(apr_proc_mutex_t *mutex)
{
    int rc;

    do {
        rc = fcntl(mutex->interproc->filedes, F_SETLKW, &proc_mutex_lock_it);
    } while (rc < 0 && errno == EINTR);
    if (rc < 0) {
        return errno;
    }
    mutex->curr_locked=1;
    return APR_SUCCESS;
}

We would guarantee mutex lock through a file descriptor to the file:

"/var/lock/apache2/ldap-cache.1368" (filedes == 15)

And the "mutex->curr_locked" would be 1.

Unfortunately, considering my stack trace, during the cache insertion:

find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a	
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68	
util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d	
util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784	
util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310	
uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3	

Name : st->util_ldap_cache_lock
	Details:0x7ff7ea75aee0
	Default:0x7ff7ea75aee0
	Decimal:140702767230688
	Hex:0x7ff7ea75aee0
	Binary:11111111111011111101010011101011010111011100000
	Octal:03777375235327340

Name : proc_mutex
	Details:0x7ff7ea75aef8
	Default:0x7ff7ea75aef8
	Decimal:140702767230712
	Hex:0x7ff7ea75aef8
	Binary:11111111111011111101010011101011010111011111000
	Octal:03777375235327370

Name : curr_locked
	Details:0
	Default:0
	Decimal:0
	Hex:0x0
	Binary:0
	Octal:0

I have curr_locked = 0

** Affects: apache2 (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Server/Client Support Team, which is subscribed to apache2 in Ubuntu.
Matching subscriptions: Ubuntu Server/Client Support Team
https://bugs.launchpad.net/bugs/1767105

Title:
  race condition on rmm for module ldap (ldap cache) - part II

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1767105/+subscriptions
Follow ups

[Bug 1767105] Re: race condition on rmm for module ldap (ldap cache) - part II
From: Andreas Hasenack, 2018-05-21