Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That is what I've done in the past with non-Python projects using ANTLR or other parser generator tools. I haven't done that yet because:

* I want to release the software that contains this validator and I want to limit the external build-time and (especially) run-time dependencies. So far, the application requires nothing except Python 2.4.

* The RE code is easier to unit test. When you use a parser generator you usually have to nominate start productions in your grammar. In order to unit test you have to add lots of extra start productions. That tends to substantially impact the readability, performance, and/or size of the generated code.

* I keep hearing that people are getting by with regular expressions to do complex stuff but traditionally I've always jumped directly to ANTLR and other similar systems. I wanted to try doing this just using regular expressions to get a good grasp of what those tools are buying me and to find out why so many people just end up using regular expressions. I found that, because this tool is just a simple validator/recognizer, those advanced tools don't really buy me much here. As you can see, the Python+RE code I posted directly mirrors the RFC 3987 grammar. If I switched to PLY or something similar, I wouldn't save anything in terms of LoC and readability wouldn't be that much improved.

* I've heard many times that it is best to minimize the amount of Python code that executes in the critical paths of a program; instead, we will get better performance by using native libraries (like the re module). Sometime I will probably try out a code generator as an experiment so that I can measure the performance of the re module vs. a bunch of generated Python code.



  > I've heard many times that it is best to minimize the
  > amount of Python code that executes in the critical 
  > paths of a program; instead, we will get better 
  > performance by using native libraries (like the re 
  > module).
Keep in mind that, while the 're' module itself may be written in C, it is in effect an interpreter for another language (regexps). You can easily build a regular expression that will happily backtrack and capture its way into a performance black hole, esp. since you can't trace the individual sub-patterns using normal Python profiling tools.

Benchmarking the two implementations, as you suggest, is of course the only way to be sure in your case.


1 - Python parser generators often are a bunch of .py files, so you won't get anything external

2 - With PLY ( and some others) you can use any non-terminal as the start in the test mode, and it won't make any impact on the performance

4 - in case of python based parser generators - lexer is a weak part. Take PLY - it's lexer is based on re module, so you wont gain much here, you can write manual lexer and boost your performance.


1. That is typically the case with parser generators. But, it is still a build-time dependency. Right now I don't really have any build-time processing except for unit tests and I want to keep "building" as simple as possible.

2. That is good to know. That indicates to me that PLY doesn't do a lot of optimizations, but maybe the optimizations that I am used to having for LL(k) parsing are not effective or necessary for (LA)LR parsing.

4. That indicates to me that the performance won't be any better than the re-only approach. What isn't shown is that I compile the "iri" and "iri-reference" regexes into two big regular expressions. AFAICT, matching one match for a big regular expression instead of a series of matches of small regular expressions will always be faster, especially if the RE engine does any (global) optimizations.

By the way, an IRI is a single token in my application so my parser is really just a lexer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: