maria-developers team mailing list archive

Thread
Date
Re: SQL Parser

To: Jim Idle <jimi@xxxxxxxxxxxxxxxxx>,maria-developers@xxxxxxxxxxxxxxxxxxx
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Sat, 9 May 2009 10:35:31 +0300
Reply-to: monty@xxxxxxxxxxxx
Hi!

I fixed the maria-developers address so that everyone else can follow
and participate in the discussion.

jim> jim> Monty,
jim> jim>
jim> jim> My T-SQL parser is now complete and a number of people are using
jim> jim> the Java, C# and C version. Would you like to have a look at it?
jim> 
jim> Monty> Yes, I would be intrested to know more about this!
jim>      
jim> OK - I have my demo online now. If you visit http://www.temporal-wave.com
jim>  
jim> and take the Online Demos->T-SQL menu option you can enter T-SQL
jim> batches and see the parse results. Over the next day or so, I am
jim> adding demos of my C# and VB.Net parsers, so if the site goes up and
jim> down then just wait a few minutes and try again :-) Once I have
jim> evaluated some of the things you mention below then I will let you
jim> have a scan at the source code.

ok. I will take a look as soon as I have got through the emails that
has batched up during the last 3 weeks.

jim>     jim> I think that I could steal a lot of common syntax from this for a
jim>     new MySQL parser. There are 1100 regression tests culled from the
jim>     T-SQL language spec and user examples; on my Linux system the parse
jim>     time, including building and walking the AST is:
jim>     jim>
jim>     jim> jimi(50_64)-clean: wc -l *.sql
jim>     jim> ...
jim>     jim> 12923 total
jim>     jim>
jim>     jim> jimi(50_64)-clean: time tsqlc . >out
jim>     jim> 0.15user 0.19system 0:00.34elapsed 100%CPU (0avgtext+0avgdata
jim>     0maxresident)k
jim>     jim> 0inputs+232outputs (0major+102772minor)pagefaults 0swaps
jim> 
jim>     jim> Which includes the time taken to read each of the 1100 files from
jim>     the file system and to print a message for each saying the parse was
jim>     successful of course.
jim>     jim> Anyway, if you want to take a look at the code, just let me know :-)
jim> 
jim>     Monty> The big question is how much work it would be to
jim>            replace the MySQLparser with your T-SQL parser.

jim> Indeed. Obviously syntactical changes are just slog really and
jim> most of that is ensuring that there is enough coverage in regression
jim> tests to exercise all the syntactic differences between T-SQL and
jim> MySql, or rather, all the syntax of MySQL. A lot of commonality but
jim> still a fair bit of work. What is the definitive language reference
jim> for MySQL Marina? I can locate reference manuals and ANSI specs of
jim> course, but if you have one particular document in mind, then it would
jim> be useful to work off that so I can evaluate the syntactical changes I
jim> would need to make.

Unfortunately there is no other spec than the sql_yacc.yy file that
comes with the MySQL source code.
jim> 
jim>     I would like to do this in two steps:
jim> 
jim>     a) Replace the mysql_yacc.yy file, with as few changes as possible in
jim>       the rest of the MySQL source.
jim>      
jim> 
jim> ANTLR is essentially the same gestalt as bison: one or more source files define the grammar and contain actions in C, triggered at the parse points; These grammar definition files are converted into C source by invoking an external program (in the case of ANTLR it is a Java program) and the resulting .c/.h files are compiled with the rest of the code.
jim> 
jim> I wrote the C output to be machine independent. The idea being that you can generate the .c files on one machine, check that source code in and use it on any other machine without re-generating. This can be useful when targeting systems without Java machines (or ones that don't work very well). So, like many other systems you can have a 'build completely from scratch' option and a 'use pre-generated files option'.
jim> 
jim> In the main then, I suspect that the bigger changes to the MySQL base would be build changes. I say this based upon the assumption that any new parser should build the same AST for a query as the yacc based version does. This means the new parser makes the same calls as the old one and thus the code external to the parser would probably require little, if anything in the way of change (but see b below).

That sounds very good.

jim>     b) Optimze MySQL to be even faster, with the new parser at a base.
jim> 
jim> Obviously there are two aspects, being the speed of the parser itself, which I don't think will be an issue, and then what kind of transformations are done on the resulting AST for optimization purposes. If we use exactly the same AST as MySQL does now, then improving MySQL performance is the same problem domain as it is now. However, perhaps there is opportunity for a new parser to provide more information about the query that would aid optimization. Possibilities here include modifications to the AST, external information structures that current code can use when walking the AST, or even a completely new AST. The latter would probably have quite an effect on the front end of the parse->AST->execution phase of course, so I would think that the first step is to try to achieve your step a with the existing AST as output.

Agree.

jim> Of course maintaining and changing an ANTLR based grammar is somewhat easier than changing bison/yacc, if for no other reason than ANTLR doesn't need much in the way of code assists for correct parsing, whereas Bison quite often does.

jim>     The a) option is quite important to be able to do easy merges with the
jim>     main MySQL tree.
jim> 
jim> I have time this week to take a look at this. Mechanically I know it is easy enough, but I need to look at the calls that are made from the current yacc/bison parser and see if there is anything in there that would be awkward. The only things I could think of would be if the existing code relies on internal structures of bison rather than more generic structures. That should be easy to define.

The MySQL code doesn't rely on bison structures at all (as far as I know).

Did you get time to do the analyse?

jim> 
jim> A few other things that may be of note:
jim> 
jim> 1) The ANTLR generated code is free threading so long as any action code that you place withing the parser is also free threading.

In other words, same as with bison

jim> 2) The only system calls it uses is {m|c|re}alloc/free, which the C runtime component can re-direct (say to the MySQL overrides). For completely free threading, those calls also need to be free threading and so they are the only limit to threading of course.

Sounds perfect.

I would *really* like to see a new parser for MariaDB. Anything that
we can do to help get this done ?

Regards,
Monty
Follow ups

Re: SQL Parser
From: Antony T Curtis, 2009-05-09