← Back to team overview

mysql-proxy-discuss team mailing list archive

Re: funnel - a multiplexer plugin for mysql-proxy

 

Hi again,

On Thu, Mar 5, 2009 at 9:46 AM, nick loeve <trickie@xxxxxxxxx> wrote:
> Hi Kay
>
> On Thu, Mar 5, 2009 at 3:39 AM, Kay Röpke <Kay.Roepke@xxxxxxx> wrote:
>>
>> On Mar 4, 2009, at 4:07 PM, nick loeve wrote:
>>
>>> Im tearing my hair out to spot the point where concurrent memory
>>> alloc/free is happening. If anyone cares to take a look that would be
>>> awesome!
>>
>>
>> just a quick note on this one: have you tried running it in valgrind? you
>> seem to be running linux judging from the earlier stacktrace.
>
> Yeah Im on Linux. I have run it through valgrind+memcheck, but I think
> the overhead of memcheck it too much to trigger what i suspect is a
> race condition.
>
>>
>> [...lot's of core reading interlude...]
>>
>> ok, i've gave it a more thorough look:
>> i'm seeing that network_backlog_add() will add a network_mysqld_con* to the
>> backlog queue in the global chassis struct.
>> further, chassis_event_threads_init_thread will use EV_PERSIST for listening
>> to events on that file descriptor, causing the event to not be discarded
>> after it fired.
>> then, when checking for stuff in the backlog, we pull one item out of the
>> global queue and associate it with our thread.
>> two questions come to mind:
>> 1) are the two events now potentially delivered to two different threads? is
>> this causing the problem?
>
> Hmm, I assumed not, but I will have a deeper look today. The two
> events are on two different file descriptors, and we push the timer
> event that restarts a connections state machine from backlog into the
> current thread's event_base, using the chassis_event_add_timer_local.
>
>> 2) is this comment & code founded in dependable behavior of libevent?
>>        /* the timer will fire immediately */
>>        event_set(&(backlog_item->client->event), -1, 0,
>> network_mysqld_con_handle, backlog_item);
>>   especially the 0 as the events we are interested in. i couldn't find in
>> the docs and libevent sources are too hard to follow for a cursory glance
>> (at least right now).
>
> Well the timer will fire at the next libevent event dispatch run. At
> least that is my understanding. Agreed, the libevent docs are pretty
> simple. I set the events to 0, as I do not want to inadvertantly
> corrupt the network queue structures by modifying the socket->to_read
> variable. So by not setting a file descriptor and using events == 0 I
> was hoping to resume from the same spot we went into backlog. Thats
> why I also had to introduce the READ_QUERY_FROM_BACKLOG connection
> state, as we have already read the query from the client before going
> to backlog, and we don't want to wait for another READ event that will
> not be coming.
>
>>
>> with regard to question one: it should be easy to find this out, by adding a
>> mutex to the connection struct and doing a trylock/lock duo first thing in
>> network_mysqld_con_handle and logging a failed trylock. this should of
>> course never fail, because only one thread is supposed to be touching that
>> connection at any given time. (this would make a good addition to 'debug'
>> builds, btw. i suspect this will not be the last time we run into this kind
>> of trouble. we could do it with an atomic int to be less intrusive but
>> that's somewhat besides the point)
>> you should not see crashes of this sort with the mutex, since then it's
>> guaranteed to be safe.
>>
>> note that it's a PITA to add the unlock, because there are return statements
>> in the middle of con_handle. if you are using gcc and wanna have some fun,
>> try adding a local variable for the mutex with __attribute__ ((cleanup
>> (some_function))) as its attribute. in some_function you'll get the address
>> of the variable as your parameter. then call unlock on the mutex...
>> refer to
>> http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html#Variable-Attributes :)
>> (neat hack, i might add. although describing it took longer than to define a
>> macro replacing 'return' with 'g_unlock(con->mutex); return' for the scope
>> of con_handle :P)
>
> I did try exactly what you described, except i did not bork if I
> couldn't get the lock. Ill have to try that out.
>
>>
>> anyway, i suspect you will see output from a failing trylock at some point.
>> if that happens i'd try to make the backlog queue thread-local. that should
>> also cure the problem.
>> this appears to be another unfortunate side effect of pinning connections to
>> one thread, although the implications of not doing so aren't exactly giving
>> me pleasant thoughts either...
>
> Ok, ill have a deeper dig, and try out the thread-local backlog.

Ok, well i found my issue finally. If I do not add the event when
placing a backend connection back in the connection pool, then i get
no crash :)

I have to work out a better way of detecting a wait_timeout etc on the
backend, as the event_del we do when pulling a connection from the
pool must fiddle with the event_base, which could belong to another
thread.

Previous to discovering this, I added the trylock/lock to
network_mysqld_con_handle, and I did see multiple threads trying to
run the same connection's state machine, although in practice (at
least in my circumstance) this should not be an issue as we return
from network_mysqld_con_handle immediately after either adding the
connection to the backlog or a WAIT_FOR_EVENT.

I also made the backlog thread-local, although this didn't help, and I
really want a global backlog.

But thanks again, you reminded me to give helgrind a go, which
although pretty flaky, managed to point me in the right direction.

Cheers, ill have an update of where i am at soon

-- 
Nick Loeve



Follow ups

References