maria-developers team mailing list archive

Thread
Date
Re: MySQL / Maria Windows Port

To: Jeremiah Gowdy <basharteg@xxxxxxxxx>
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Sun, 15 Mar 2009 15:17:50 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <49B63942.9000307@gmail.com>
Reply-to: monty@xxxxxxxxxxxx
Hi!

<cut>

>> The my_pwrite() code now looks like:
>> 
>> ....
>> #if defined(_WIN32)
>> readbytes= my_win_pread(Filedes, Buffer, Count, offset);
>> #else
>> ...
>> 
>> And there is ew my_winfile.c file.
>> 
>> Is that your code?
<cut>

Jeremiah> Yeah, this is definitely 99% my code.  Someone spent the time to clean 
Jeremiah> it, normalize it, and add a few fixes and polish, but it's essentially 
Jeremiah> the same as what I submitted.  Unfortunately, there's no one line 
Jeremiah> attribution I was promised for contributing the code.  :)  But at least 
Jeremiah> it's in there, that's a great start.

Sorry about the attribution; I will ensure that when we pull/implement
this to MariaDB we will add it!

<cut>

>> Could you take a look a the MySQL-6.0 mysys/windows code and see if
>> you think if it's now good enough.
>> 
>> If not, do you think we should take the 6.0 code as a base for our
>> tree or should we start with your code?
>> 

Jeremiah> The MySQL 6.0 code for Windows file handling is a good starting point, 
Jeremiah> but you can see the primary "hack" of my code is the my_open_osfhandle 
Jeremiah> function, which is a table specifically for emulating the C style 
Jeremiah> integer file handles.  If I recall correctly (it has been years) the 
Jeremiah> only reasons I had to implement this were:

Jeremiah> As you said, the my_file_info structure, which uses the file descriptor 
Jeremiah> itself as an index.

Jeremiah> my_file.c
Jeremiah>   if ((uint) fd < my_file_limit && my_file_info[fd].type != UNOPEN)

Jeremiah> There you can see the opacity of the file descriptor is violated by 
Jeremiah> using it as a bounds check for the array, although that's obviously 
Jeremiah> follows using it as an index into the array.  When I made the patch my 
Jeremiah> understanding was that I needed to keep as many of the changes as 
Jeremiah> possible within Windows ifdefs, and that changing the structure that 
Jeremiah> contains all of the file handles significantly would be too much change 
Jeremiah> to non-Windows code.

Yes, we should to strive to keep most of the code 'similar' to ensure
easy pulls between MySQL versions.

Jeremiah> There are obviously a lot of ways to resolve this.  If I were writing it 
Jeremiah> from scratch, I wouldn't pass the file descriptor around to different 
Jeremiah> functions, I would pass a pointer to the my_file_info structure around 
Jeremiah> instead, and all of the mysys file handling code would take a 
Jeremiah> my_file_info structure pointer as the file desc parameter.  There's no 
Jeremiah> need to index into an array if you already have the pointer to the 
Jeremiah> struct in question.  The code would become portable, and the 
Jeremiah> serialization on file open/close (fairly minor) would disappear.

Agree that a much better way would be to move away from int's and have
a structure instead.

It should however not be that hard to do this properly as most of the
MySQL code is using the type 'File' for the file handle.

My old idea was to create a new library, my_file, and copy all the
file handling and file name functions from this to my_file and then
change this to use a pointer to a structure instead of a File object.

The File objects should also contains the file name in parts and one
should be able to manipulate the file name without having the file open.

The changes in the code would mainly be:

- Create the new library.
- Add a separate call to populate the File object with the name parts
- Change all calls to my_open / my_read to my_file_open, my_file_read
  etc... (Can be done trivially with one replace command)
- Change error handling to instead of checking if my_file_open returns
  -1 see if it returns MY_INVALID_FILE
- Over time, change all file and file name handling functions to use the new
  file handling functions.

You can find the original description of this task here:
http://forge.mysql.com/worklog/task.php?id=314

I have included a halfdone version if this code in this email.


basharteg> If things are
basharteg> being designed in a way that isn't entirely UNIX-centric, it's easy to make
basharteg> MySQL for Windows fly.  If we get to the point where we're willing to look
basharteg> at async socket I/O and thread queues / I/O Completion Ports for work
basharteg> dispatch, we can do even more.

>> I don't know much about that, so I would be really thankful for any
>> help I can get in this area.

Jeremiah> It's going to take a fair amount of design work to figure out how to 
Jeremiah> adapt MySQL to this kind of model.  It could be implemented on all 
Jeremiah> platforms with epoll on Linux, kqueue on FreeBSD, and I/O Completion 
Jeremiah> Ports on Windows.  Event notification I/O is the key to serious 
Jeremiah> scalability, but it requires significant design changes, especially when 
Jeremiah> you're coming from a 1:1 threading model.

I have always been worried about the fact that if you have other
threads doing IO, you will always get a thread switch + some atomic
operations / mutex when doing any read or write operations.

The 1:1 model makes handling io operations much easier and if things
are buffered then this can be superior compared to having a pool for data.

Also, if all the threads are mixing cpu bound and io operations, this
gives you somewhat balanced load that works reasonable good in many
cases.

I understand that pools are superior in some context; I am just not
yet sure when it's best to use a pool and when to use 1:1 threads.

<cut>

>> In MySQL 6.0 we you can choose at startup if you want to have one
>> thread per connection or a pool of threads to handle all connections.
>> I wrote the patch in such a way that it's < 10 min of works to pull
>> this code back to 5.1 (or MariaDB); Will do this next week.
>> 
>> http://dev.mysql.com/doc/refman/6.0/en/connection-threads.html
>> 

Jeremiah> This change makes significant progress towards moving to event 
Jeremiah> notification I/O.  That means you broke the state information up between 
Jeremiah> thread and connection and eliminated the bad thread == connection 
Jeremiah> assumptions.

Yes. The patch is now in lp:maria

basharteg> So in terms of MySQL, it definitely runs with a 1:1 threading model using
basharteg> blocking I/O.  While this isn't uncommon for UNIX applications, you'll
basharteg> notice the "hot" UNIX apps that are more interested in the C10K (
basharteg> http://www.kegel.com/c10k.html) target for their scalability are starting to
basharteg> move to async I/O implementations.  This has produced some good output like
basharteg> lighttpd (http://www.lighttpd.net/).

>> I haven't yet made up my mind what's the best route for MariaDB in
>> this case; I am waiting for a benchmark or nice code that will make up
>> my mind for me...

The thread goes both ways.

When MySQL was created, it was more common for people to use using
async instead of an 1:1 thread model.  As far as people have told me,
after MySQL and as the thread libraries got better a lot of programs
changed to us a 1:1 thread model.

Don't think it has yet been proved what is the best model/mix long term.
A lot depends on the worklog and what the threads are doing.

For example in the MySQL case:

- If all your queries are short and there are very little conflicts
  a pool-of-threads version is much better than a 1:1 model.
- If you have a big mix long running queries and short running
  queries, a 1:1 thread application is superior to simple pole based
  one.

Jeremiah> Yeah, benchmarks are tough too, because serial benchmarks will tend to 
Jeremiah> show the overhead involved in asynchronous I/O operations.  You have to 
Jeremiah> decide that you're developing for the multi-core world and accept the 
Jeremiah> minor overhead in exchange for the significant scalability enhancements, 
Jeremiah> and design your benchmarks to implement concurrent loads.

See above.  What is best is mostly depending one what the threads are
supposed to do. Multi-core is not the most critical part here.

<cut>

>> Note that the thread pool is still 'beta' quality, in the sense it
>> still have some notable week spots:
>> 
>> - If all threads are working or hang (in lock tables) then no new
>> connections can be established, one can't examine what is going on
>> and one can't kill offending threads (as one can't send in new queries)
>> 
>> The easyest fix would be to have another port where one could connect
>> with the 1:1 model and check what's going on.

I added this to the MariaDB thread-pool implementation that is now pushed.

Jeremiah> Having an administrative connection is something MySQL has needed for a 
Jeremiah> very long time anyway.  However, as someone who uses thread pools 
Jeremiah> extensively on a daily basis, you can handle the cases of thread pool 
Jeremiah> exhaustion in several ways, including safety timer events, timeouts on 
Jeremiah> lock waits, event based notification on lock acquisition (may be 
Jeremiah> difficult).

This is not easy when you don't want to abort any of the running
queries.

As the lock between threads can bappen very low, for example inside a
storage engine for which you don't have any control, none of the above
solutions are easy or even possible to implement.

Jeremiah>  But as someone who runs a fairly large MySQL system for a 
Jeremiah> telephony company, I can tell you that the 1:1 threading model doesn't 
Jeremiah> save you from thread exhaustion any more than having a thread pool 
Jeremiah> would.  When we get threads stuck waiting for locks our servers can 
Jeremiah> easily become thread exhausted with no way to kill the threads.  

As long as you limit max_connections under the number of threads your
system can handle, you should always be able to connect as a SUPER
user and fix things.

Jeremiah> Besides, in a good design, you shouldn't have just one thread pool.  The 
Jeremiah> thread pool that handles I/O events (like accepting new connections) 
Jeremiah> should be different from the worker thread pool that executes SQL 
Jeremiah> statements.

In the current thread pool implementation, the threads are waiting in
the pool for any request on connection port; In this case the
connection/login handling and reading commands from the user are both
io events.

Jeremiah> In the Windows async I/O world, we consider connections to be relatively 
Jeremiah> inexpensive.  I don't set connection limits in my applications.  Each 
Jeremiah> connection is a little bit of non-paged pool kernel memory, plus the 
Jeremiah> little struct I use to keep the connection state information, plus one 
Jeremiah> async I/O read pending (among thousands).

You also need to keep the connection state, which in MySQL means the
THD object.  This is a semi-expensive object (10 K)

Jeremiah> But yes, one of the basics of our design is having at least two thread 
Jeremiah> pools, one for socket I/O (at least so far as new connections, posting 
Jeremiah> reads, and receiving queries) and one for SQL statements.  The socket 
Jeremiah> I/O thread pool should not be something that deadlocks, since whatever 
Jeremiah> locking is used, it would be far more deterministic than the locking 
Jeremiah> involved in executing SQL statements.

In the current implementation, we have only one pool and in MySQL:s
case it may be enough.

The reason for this is that when the thread gets a signal that there
is data to be read, it will have all data it needs in the buffer (as
the client commands are mostly very short and if they are not short,
the client is working full time to send it as they are always send
whole).
 
This means that there is in practice no deadlocks when it comes to
reading data and you win the context switch from reading data and then
giving it to a worker thread.

Jeremiah> Another thing to throw out there since I'm already being terribly 
Jeremiah> verbose is, each query should have a timeout value, and when that 
Jeremiah> timeout value is exceeded, the query should terminate.  One of the worst 
Jeremiah> parts about working in MySQL is the idea of a stuck query, especially 
Jeremiah> when you need a separate connection to kill it.  In systems like 
Jeremiah> Microsoft SQL, the default query timeout is 2 minutes, and if you need 
Jeremiah> to exceed that or you don't want a timeout, you can change it as part of 
Jeremiah> the connection settings.  Runaway queries running indefinitely doesn't 
Jeremiah> make any practical sense.

This agains depends on the application. Most applications need to do
long statistical queries from time to time and you don't want to kill
these!

Same thing with any queries that does critical updates or moves things
around (like ALTER TABLE, REPAIR etc).

In other words;  I agree we need a timeout mechanism. I am however not
sure what should be the best default value for this.

That said, Guilhem did once implement a timeout for queries but this
was unfortunately never pushed becasue it changed the code too much at
the time when the patch was made and we where close to GA.

We should dig out this patch and fix it for the current code and add it.

Jeremiah>  Also, when we get async I/O in place, you 
Jeremiah> could eventually add an extension to the protocol that allows the client 
Jeremiah> to cancel the query in progress.  Since the incoming I/Os are handled by 
Jeremiah> I/O threads rather than the SQL threads, you can still receive commands 
Jeremiah> while the query is in progress.  So even as you're executing a query for 
Jeremiah> someone, you have an async read on the socket waiting for the next 
Jeremiah> command.  If the next command is received while the query is still in 
Jeremiah> progress, you check to see if it's a cancel command, or simply a 
Jeremiah> pipelined SQL query.  If it's a cancel command, you flag your SQL thread 
Jeremiah> to stop when it can, and if it's a pipelined SQL query they're probably 
Jeremiah> violating the MySQL protocol anyway since they need to wait for a 
Jeremiah> response to the previous query, but if we allowed pipelined queries some 
Jeremiah> day in the future, they've lost their right to cancel this query and you 
Jeremiah> simply queue the SQL query until the current query is finished.  In the 
Jeremiah> far future, multiple queries on the same connection are fine, and are 
Jeremiah> given query identifiers and all handled through async I/O, and all 
Jeremiah> queries are cancellable.

Now you can achieve the same thing by just sending the kill one another
connection (could for example be used through the extra-port).

Drizzle has done some work in the direction of sending many queries on
the same connection.  We should look at this when their implementation
is ready.

I am still a bit sceptical of having two thread pools;  My major worry
is that it will for normal users slow down things instead of giving a
speed increase (because of the extra context switch and mutex/atomic
operations to shuffle things around)

We need to find someone that can write a prototype and show the
benefits trough some agreed to benchmark...

basharteg> We need to find the points in there where I/Os are taking place that can be
basharteg> adapted to an async model.  Basically, the disk reads that are happening to
basharteg> satisfy the query, the network reads to get the query, the network writes to
basharteg> send the results, and the network waits for the next query.  All of those
basharteg> things need to happen without a thread being associated with this
basharteg> connection/query.

We need a proof of concept and benchmark of this...

>> For mysqld this isn't trivial as we can't easily change the code for
>> the engines.
>> 
>> What I don't like with the async model is that for reads on disk it
>> doesn't speed up things as much and it may even slow down thing as you
>> now have two threads communication and the original is just waiting
>> for the other to take the message, execute it, and tell the other that
>> it's now ready.

Jeremiah> Well, you can avoid async I/O for disk I/O if you think it's going to be 
Jeremiah> a performance negative.  On single or few drive SATA systems, you're 
Jeremiah> right.  On high end I/O subsystems, async I/O can definitely have a 
Jeremiah> throughput benefit, especially when you have an advanced storage 
Jeremiah> subsystem driver that combines operations.  As we get to Solid State 
Jeremiah> Disks in the future, it's going to be the same thing as with the 
Jeremiah> multi-core CPUs we're adapting to.  If you're not doing multiple things 
Jeremiah> at once, your performance is going to be weak.  For now though, you can 
Jeremiah> leave out async disk I/O, but async socket I/O is the only way to get up 
Jeremiah> to tens of thousands of connections and beyond.  You shouldn't be afraid 
Jeremiah> of the overhead of using multiple threads and multiple thread pools.  If 
Jeremiah> done properly, the penalty to serial operations is minimal, and the 
Jeremiah> performance gains are significant (in addition to the robustness 
Jeremiah> enhancements previously discussed).

The pool-of-threads code we have now already gives us 100,000 of
connections.

But I am still afread of splitting the current thread pool to 2 parts...

<cut>

>> What would be the advantages of using Apache portable runtime?
>> Should this be instead of mysys ?
>> 
>> The benefits of using mysys are:
>> - Well working and integrated with all MySQL parts
>> - Well integrated with dbug statements
>> - Provides a lot of extra features, like hiding windows thread
>> implementation.
>> 

Jeremiah> I can understand moving to apr from mysys might not be feasible, I just 
Jeremiah> bring it up because they've abstracted async and event driven I/O and a 
Jeremiah> lot of other platform differences very well.  It works great on 
Jeremiah> Windows,  FreeBSD, and Linux.  Probably not something we can go for 
Jeremiah> right now, but they may have some useful designs and hints.

I am happy to take any good ideas from their libraires and add them to
ours and even move some of our code in their direction (if there is a
clear gain for this).

Regards,
Monty
Attachment: my_file.tar.gz
Description: Binary data