[Update: oops. I forgot that the server parses file named "whatever.php" when uploading things
. The directory at pear.chiaraquartet.net/lemon (clickable link below) now contains both .phps and .php.txt files for downloading. sorry about that]
So, it's been a while since I wrote an entry, but I have not been idle (are you really all that surprised?) I took a hard look at the projects I am involved in, and realized rather quickly that the only thing really holding back several of them is a good parser generator. Here's the short list of projects that really need a parser generator:
- phpDocumentor. We use parsers for *everything*
- PHP_Parser. The name says it all
- Games_Chess/File_ChessPGN. For this team, we need a good parser for the PGN file format.
Recently, I completed a rather nice lexer for docblocks (documentation comments) and this is now available at http://pecl.php.net/docblock, for those who are interested. About three weeks ago, I looked at the state of the parser generator world out there for PHP, and it is pretty dismal. Antlr3 will theoretically support PHP 5 generation, but it's impossible to find any source in spite of several fruitless hours of googling.
I finally decided that if this is ever going to happen, I'll have to get off my butt and do it. So, two weeks ago, I grabbed the source of the Lemon parser generator from its website (conveniently compressed into two files: the generator and its template). Although the 4000+ lines may have scared me off, the code is very clearly written, making minimal use of pre-processor macros, and actually lends itself very easily to translation into PHP code. I spent about a week doing the actual transcription from C to PHP, removing all the malloc-related crap, converting the C implementation of associative arrays into PHP associative arrays, and finally I had something that works. In the past week, I've been scraping on the template, which was pretty complete, but didn't quite do everything I needed.
For one example, upon a syntax error, there was no easy way to retrieve a list of expected tokens. This turned out to be a very hard problem, until I broke down and yesterday added a generated array of expected tokens. Coupling this with a little reduce simulator, it is quite easy to grab the complete list of expected tokens based on the current token and the parser stack.
Another tricky problem was that if an unexpected token occurred right at a potential end-of-input moment, the parser simply reduced to an accept, and silently restarted parsing. This is a bad thing (TM). So, I implemented a simple "is this token possible in the current state of things?" function that catches these pesky errors.
In the process, I have a fully working PGN file parser that will make its way into a PEAR proposal as soon as I get around to integrating it with Games_Chess to do full validation of the contents of the PGN file. However, the parser works 100% even with some of the weirdest PGN things I could throw at it.
The parser generator works just like Lemon with a few small differences. I added a few line directives:
- %include_class - this works like %include, but puts stuff inside the generated parser class.
- %declare_class - this can be used to make the parser implement an interface, extend another class, etc. and is used like "%declare_class {extends blah} <-- note the lack of semi-colon, this is inserted between "class ParseyyParser" and "{"
The parser should be called with a loop similar to this:
$lexer = new File_ChessPGN_Lexer($contents);
$lexer = new File_ChessPGN_Lexer($lexer);
while ($lexer->advance($parser)) {
$parser->doParse($lexer->token, $lexer->value);
}
$parser->doParse(0, 0); // for end of input
To run it, save Main.php and Lempar.php in the same directory, and take a look at the bottom of Main.php for an example run. You can also just run it from the command line:
php Main.php /path/to/Parser.y
Parser.y is the fully-functioning PGN file parser that will be integrated into File_ChessPGN. I plan to test out the parser generator and split it into multiple files, then potentially propose it as a PEAR package. It should be noted that the original Lemon authors disclaim copyright to the code. This port to PHP, however, will be licensed under either PHP 3.01, new BSD or something along those lines.
You can grab the stuff from http://pear.chiaraquartet.net/lemon.