That is what I've done in the past with non-Python projects using ANTLR or other...

rcoder · on Sept 18, 2008

  > I've heard many times that it is best to minimize the
  > amount of Python code that executes in the critical 
  > paths of a program; instead, we will get better 
  > performance by using native libraries (like the re 
  > module).

Keep in mind that, while the 're' module itself may be written in C, it is in effect an interpreter for another language (regexps). You can easily build a regular expression that will happily backtrack and capture its way into a performance black hole, esp. since you can't trace the individual sub-patterns using normal Python profiling tools.

Benchmarking the two implementations, as you suggest, is of course the only way to be sure in your case.

alexk · on Sept 18, 2008

1 - Python parser generators often are a bunch of .py files, so you won't get anything external

2 - With PLY ( and some others) you can use any non-terminal as the start in the test mode, and it won't make any impact on the performance

4 - in case of python based parser generators - lexer is a weak part. Take PLY - it's lexer is based on re module, so you wont gain much here, you can write manual lexer and boost your performance.

briansmith · on Sept 18, 2008

1. That is typically the case with parser generators. But, it is still a build-time dependency. Right now I don't really have any build-time processing except for unit tests and I want to keep "building" as simple as possible.

2. That is good to know. That indicates to me that PLY doesn't do a lot of optimizations, but maybe the optimizations that I am used to having for LL(k) parsing are not effective or necessary for (LA)LR parsing.

4. That indicates to me that the performance won't be any better than the re-only approach. What isn't shown is that I compile the "iri" and "iri-reference" regexes into two big regular expressions. AFAICT, matching one match for a big regular expression instead of a series of matches of small regular expressions will always be faster, especially if the RE engine does any (global) optimizations.

By the way, an IRI is a single token in my application so my parser is really just a lexer.