Last week, I blogged about completing a port of the Lemon parser generator to PHP 5, which I thought was pretty cool.
However, in an email, Alex Merz pointed out that without a lexer generator to accompany lemon, it's pretty difficult to write a decent parser. The lexer in the example directory was generated using a program Alan Knowles ported from C# that was generating C# lexers. Alan modified the code to output PHP instead, and although I was able to get it to work some of the time, I had tremendous difficulty even getting C# to run on my computers when the original machine kicked the bucket. In addition, making changes to the output was really, really difficult as the output was completely uncommented and frankly made little sense. I kept running into bugs in the regular expression compilation, and so that was that.
After Alex's email, I started thinking about what it would take to write a lexer generator. Basically, a lexer generator requires parsing and compiling regular expressions, then scanning the source one character at a time to find matches. A lexer must also define precedence of the regular expressions so that the proper token is matched at the proper time. In addition, any lexer worth its salt has some support for different scanning states. For instance, different tokens are returned in the middle of a double-quoted string than are returned in global PHP code.
So, it occurred to me that perhaps simply combining regular expressions with sub-patterns could accomplish this task quite easily.
So, if you want to scan for the word "foo" followed by any grouping of letters, you could use this regular expression (Perl-compatible regular expression format):
/^(foo)|^([a-zA-Z]+)/
When used with preg_match() and specifying the optional third argument, it is possible to determine which token matched first simply by checking which is the 2nd array index. If this sounds like Greek to you, just check out the source of a generated lexer (more on that later).
Once this concept was discovered, the only task left was to decide on a format for the lexer. I've been a huge fan of re2c, partially because it makes it easy to write the patterns, and partially because the abstract lexer file itself still parses as valid C, simply because all lexer stuff is neatly contained within comments. Also, it's fast because of the way the code is generated. Implementing re2c directly in PHP is impossible simply because goto is not (yet) supported in existing PHP versions.
However, the benefit of having goto implemented in PHP would not make a faster lexer than directly using the preg functions. Why? Each opcode eats up lots more instruction cycles than the natively compiled preg matching. In other words, why re-invent the wheel when it will just be slower and bigger?
Once this decision was made, I was surprised to discover that a hand-written lexer for the new "lex2php" format was really easy to write, and did that in 1 day. Using the handy-dandy parser generator, I wrote a parser for the lexer generator that outputs the lexer (1 more day, truly demonstrating its ease of use and power). Finally, I ported the lexer for File_ChessPGN to the new format (1/2 day).
At this point, I thought it might be good to do a bit of benchmarking against the only generated lexer I had, the one for File_ChessPGN generated by csLex. Much to my delight and tremendous surprise, the newer lexer is consistently ~2.2 times faster lexing the same complex PGN file when profiled by Zend Studio (approximately 130 ms vs 285 ms).
So, today, I packaged up the PHP port of the Lemon Parser Generator as "PHP_ParserGenerator" and the lex2php lexer as "PHP_LexerGenerator" and have proposed them as additions to PEAR. Give them a whirl!
The PHP_LexerGenerator really needs a regular expression validator to catch things before it is too late, so I am working a few possibilities for that one. I'll probably end up writing a regular expression parser using PHP_Lexer/ParserGenerator, we'll see
.
P.S. if you would like to get in on the action of developing these exciting packages, don't hesitate to drop a line. I'm pretty easy to work with!
PHP_LexerGenerator proposal
PHP_ParserGenerator proposal