Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice].

Here are some things a lexer for a programming language might have to deal with:

1. Comments (some even do nested - which means regular expressions are out for that).

2. Continuation lines.

3. Includes (if done at the lexical level).

4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).

5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.

6. Identifiers.

So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.



Another thing:

Some languages allow escapes before everything else. Looking at you, Java. So either you need to do a pass beforehand to unescape them, or unescape characters (and in the process do error handling / etc!) on-the-fly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: