← Back to team overview

maria-developers team mailing list archive

Re: SQL Parser

 


On 9 May 2009, at 10:22, Jim Idle wrote:

Antony T Curtis wrote:

On 9 May 2009, at 00:35, Michael Widenius wrote:


jim> Indeed. Obviously syntactical changes are just slog really and
jim> most of that is ensuring that there is enough coverage in regression jim> tests to exercise all the syntactic differences between T-SQL and jim> MySql, or rather, all the syntax of MySQL. A lot of commonality but jim> still a fair bit of work. What is the definitive language reference jim> for MySQL Marina? I can locate reference manuals and ANSI specs of jim> course, but if you have one particular document in mind, then it would jim> be useful to work off that so I can evaluate the syntactical changes I
jim> would need to make.

Unfortunately there is no other spec than the sql_yacc.yy file that
comes with the MySQL source code.

I have a tool somewhere which can build .dot files from the .yy files in the same way how ANTLRworks draws its syntax diagrams.... Copies of the tool exist somewhere on the old MySQL internal repositories but I am sure I have a backup copy somewhere.
Yes, I added this ability to ANTLR C runtime too. You can see the effect here:

http://www.temporal-wave.com/index.php?option=com_psrrun&view=psrrun&Itemid=56

What I mean was... Feed in the sql_yacc.yy file and output a human understandable syntax diagram.




jim>
jim>     I would like to do this in two steps:
jim>
jim> a) Replace the mysql_yacc.yy file, with as few changes as possible in
jim>       the rest of the MySQL source.
jim>
jim>
jim> ANTLR is essentially the same gestalt as bison: one or more source files define the grammar and contain actions in C, triggered at the parse points; These grammar definition files are converted into C source by invoking an external program (in the case of ANTLR it is a Java program) and the resulting .c/.h files are compiled with the rest of the code.
jim>
jim> I wrote the C output to be machine independent. The idea being that you can generate the .c files on one machine, check that source code in and use it on any other machine without re- generating. This can be useful when targeting systems without Java machines (or ones that don't work very well). So, like many other systems you can have a 'build completely from scratch' option and a 'use pre-generated files option'.
jim>
jim> In the main then, I suspect that the bigger changes to the MySQL base would be build changes. I say this based upon the assumption that any new parser should build the same AST for a query as the yacc based version does. This means the new parser makes the same calls as the old one and thus the code external to the parser would probably require little, if anything in the way of change (but see b below).

That sounds very good.

Not possible to build the same AST as the yacc/bison based parser because there is no AST.
The execution structures are directly built during the parsing phase.
Sure - for AST, you can read execution structures - there didn't really seem to be any nomenclature but when I discussed this with some people at Sun a while back, they used the terms AST and tree, so I just copied them :-) So I think I would use the AST I already build, to generate the execution structures there currently. Having an AST means that there is a structure that can be later used for optimization (should that be any better than optimization the current execution structures).

For a few of us, AST != execution structures .... mostly because the execution structures are very much not "abstract" and were dictated by historical and implementation issues.

I have argued years ago that a good AST would allow better optimization later cheaply.

For example, right now, there is a lot of work to try to simplify the Item trees for boolean expressions. This work would be better done with ASTs without wasting as much memory.




Agree

You will have to implement an all-new AST which then can be used to 'compile' the execution structures. (this was a minor sticking part of an earlier attempt to replace the parser - there were objections to building an AST and adding an additional stage to the execution)
Yeah - I just wrote that before I read this paragraph ;-) As I already have this structure, it can be shown to pretty much zero overhead compared to query optimization and execution time. A good AST mostly simplifies rather than complicates things of course. Whether my current AST needs shaping to optimally generate MySQL remains to be seen, but it is easy to change if needed.

++Agree

The only tricky issues is to try to generate exactly the same structures but in some instances later, there will be scope for actually creating simpler structures than what MySQL currently generates.



n more generic structures. That should be easy to define.

The MySQL code doesn't rely on bison structures at all (as far as I know).

AFAIK, MySQL does not use any Bison internals as such.
Yes - I could not see any reliance myself.
A past project from many years ago, I put into MySQL a PCCTS (pre- ANTLR) LL parser, basically keeping MySQL's custom lexer but it generated Token objects instead suitable for the PCCTS generated parser. The project was never completed because integration of Stored Procedures became the objective for MySQL 5.0 instead of implementation of a new parser.

I have asked Stewart Smith to see if he can help make public that old source repository as a Bazaar repository as people may be able to learn from it how to replace the parser without major pain.
Sure - any prior work obviously helps, but I think that the lexing is already good. I might make changes to keyword detection to become MySql friendly; T-SQL is a historically a bit unsure exactly when keywords can be used. This would reduce the DFA sizes for the parser.


Did you get time to do the analyse?

jim>
jim> A few other things that may be of note:
jim>
jim> 1) The ANTLR generated code is free threading so long as any action code that you place withing the parser is also free threading.

In other words, same as with bison

The ANTLR runtime does, of course, require Java.
This, this would be my C runtime as Java is way too slow for this. There is no Java required at runtime, only for developers to generate the C.
So developers who wish to extend the grammar do have to have the Java Runtime installed as well as the ANTLR jars.
For grammar yes as the ANTLR tool runs in Java. In practice this isn't really a problem I think?

Yeah. When I wrote "ANTLR runtime", I meant "ANTLR compiler".




I too am interested to help if at all possible.
One thing that this would need would be a good regression testing suite. Right now I have 1300 .sql files that prove the lexer/parser/ ast, but there would need to be more of these and runtime testing and so on.

There will have to be proper unit tests to show that a certain AST builds the correct required MySQL execution structures (Item trees, JOIN, TABLE_LIST etc etc).

The best part of such an effort would be "At last! a formal specification of how the execution structures should look like".

This would be exciting and cool if we can (at last) have a sane parser... especially as LL parsers always generate much nicer error messages for the users.

Regards,
Antony.



References