maria-developers team mailing list archive

Thread
Date

Re: SQL Parser

To: Jim Idle <jimi@xxxxxxxxxxxxxxxxx>
From: Antony T Curtis <atcurtis@xxxxxxx>
Date: Sat, 09 May 2009 11:12:20 -0700
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4A05BBC7.4070108@temporal-wave.com>


On 9 May 2009, at 10:22, Jim Idle wrote:

Antony T Curtis wrote:
On 9 May 2009, at 00:35, Michael Widenius wrote:
jim> Indeed. Obviously syntactical changes are just slog really and
jim> most of that is ensuring that there is enough coverage inregressionjim> tests to exercise all the syntactic differences between T-SQLandjim> MySql, or rather, all the syntax of MySQL. A lot ofcommonality butjim> still a fair bit of work. What is the definitive languagereferencejim> for MySQL Marina? I can locate reference manuals and ANSIspecs ofjim> course, but if you have one particular document in mind, thenit wouldjim> be useful to work off that so I can evaluate the syntacticalchanges I
jim> would need to make.

Unfortunately there is no other spec than the sql_yacc.yy file that
comes with the MySQL source code.
I have a tool somewhere which can build .dot files from the .yyfiles in the same way how ANTLRworks draws its syntax diagrams....Copies of the tool exist somewhere on the old MySQL internalrepositories but I am sure I have a backup copy somewhere.
Yes, I added this ability to ANTLR C runtime too. You can see theeffect here:
http://www.temporal-wave.com/index.php?option=com_psrrun&view=psrrun&Itemid=56

What I mean was... Feed in the sql_yacc.yy file and output a humanunderstandable syntax diagram.

jim>
jim>     I would like to do this in two steps:
jim>
jim> a) Replace the mysql_yacc.yy file, with as few changes aspossible in
jim>       the rest of the MySQL source.
jim>
jim>
jim> ANTLR is essentially the same gestalt as bison: one or moresource files define the grammar and contain actions in C,triggered at the parse points; These grammar definition files areconverted into C source by invoking an external program (in thecase of ANTLR it is a Java program) and the resulting .c/.h filesare compiled with the rest of the code.
jim>
jim> I wrote the C output to be machine independent. The ideabeing that you can generate the .c files on one machine, checkthat source code in and use it on any other machine without re-generating. This can be useful when targeting systems without Javamachines (or ones that don't work very well). So, like many othersystems you can have a 'build completely from scratch' option anda 'use pre-generated files option'.
jim>
jim> In the main then, I suspect that the bigger changes to theMySQL base would be build changes. I say this based upon theassumption that any new parser should build the same AST for aquery as the yacc based version does. This means the new parsermakes the same calls as the old one and thus the code external tothe parser would probably require little, if anything in the wayof change (but see b below).
That sounds very good.
Not possible to build the same AST as the yacc/bison based parserbecause there is no AST.
The execution structures are directly built during the parsing phase.
Sure - for AST, you can read execution structures - there didn'treally seem to be any nomenclature but when I discussed this withsome people at Sun a while back, they used the terms AST and tree,so I just copied them :-) So I think I would use the AST I alreadybuild, to generate the execution structures there currently. Havingan AST means that there is a structure that can be later used foroptimization (should that be any better than optimization thecurrent execution structures).

For a few of us, AST != execution structures .... mostly because theexecution structures are very much not "abstract" and were dictated byhistorical and implementation issues.

I have argued years ago that a good AST would allow betteroptimization later cheaply.

For example, right now, there is a lot of work to try to simplify theItem trees for boolean expressions. This work would be better donewith ASTs without wasting as much memory.

Agree
You will have to implement an all-new AST which then can be used to'compile' the execution structures.(this was a minor sticking part of an earlier attempt to replacethe parser - there were objections to building an AST and adding anadditional stage to the execution)
Yeah - I just wrote that before I read this paragraph ;-) As Ialready have this structure, it can be shown to pretty much zerooverhead compared to query optimization and execution time. A goodAST mostly simplifies rather than complicates things of course.Whether my current AST needs shaping to optimally generate MySQLremains to be seen, but it is easy to change if needed.


++Agree

The only tricky issues is to try to generate exactly the samestructures but in some instances later, there will be scope foractually creating simpler structures than what MySQL currentlygenerates.

n more generic structures. That should be easy to define.
The MySQL code doesn't rely on bison structures at all (as far asI know).
AFAIK, MySQL does not use any Bison internals as such.
Yes - I could not see any reliance myself.
A past project from many years ago, I put into MySQL a PCCTS (pre-ANTLR) LL parser, basically keeping MySQL's custom lexer but itgenerated Token objects instead suitable for the PCCTS generatedparser. The project was never completed because integration ofStored Procedures became the objective for MySQL 5.0 instead ofimplementation of a new parser.
I have asked Stewart Smith to see if he can help make public thatold source repository as a Bazaar repository as people may be ableto learn from it how to replace the parser without major pain.
Sure - any prior work obviously helps, but I think that the lexingis already good. I might make changes to keyword detection to becomeMySql friendly; T-SQL is a historically a bit unsure exactly whenkeywords can be used. This would reduce the DFA sizes for the parser.
Did you get time to do the analyse?

jim>
jim> A few other things that may be of note:
jim>
jim> 1) The ANTLR generated code is free threading so long as anyaction code that you place withing the parser is also freethreading.
In other words, same as with bison
The ANTLR runtime does, of course, require Java.
This, this would be my C runtime as Java is way too slow for this.There is no Java required at runtime, only for developers togenerate the C.
So developers who wish to extend the grammar do have to have theJava Runtime installed as well as the ANTLR jars.
For grammar yes as the ANTLR tool runs in Java. In practice thisisn't really a problem I think?


Yeah. When I wrote "ANTLR runtime", I meant "ANTLR compiler".

I too am interested to help if at all possible.
One thing that this would need would be a good regression testingsuite. Right now I have 1300 .sql files that prove the lexer/parser/ast, but there would need to be more of these and runtime testingand so on.

There will have to be proper unit tests to show that a certain ASTbuilds the correct required MySQL execution structures (Item trees,JOIN, TABLE_LIST etc etc).

The best part of such an effort would be "At last! a formalspecification of how the execution structures should look like".

This would be exciting and cool if we can (at last) have a saneparser... especially as LL parsers always generate much nicer errormessages for the users.


Regards,
Antony.

References

Re: SQL Parser
From: Michael Widenius, 2009-05-09
Re: SQL Parser
From: Antony T Curtis, 2009-05-09