[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice].
Here are some things a lexer for a programming language might have to deal with:
1. Comments (some even do nested - which means regular expressions are out for that).
2. Continuation lines.
3. Includes (if done at the lexical level).
4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).
5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.
6. Identifiers.
So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.
Some languages allow escapes before everything else. Looking at you, Java. So either you need to do a pass beforehand to unescape them, or unescape characters (and in the process do error handling / etc!) on-the-fly.
Here are some things a lexer for a programming language might have to deal with:
1. Comments (some even do nested - which means regular expressions are out for that).
2. Continuation lines.
3. Includes (if done at the lexical level).
4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).
5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.
6. Identifiers.
So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.