5 Regular Expressions Every Web Programmer Should Know

silentbicycle · on Sept 18, 2008

Instead of learning $bullet_point_count regular expressions by rote, just learn how they work and make your own as you need them. They really aren't hard to use, they're just incredibly compact, which makes them look unintelligible at first.

_Mastering Regular Expressions_ is the obvious classic, but a decent intro to Perl or Python will probably get to them eventually. (If you use Emacs, it's particularly easy to learn them by experimenting with M-x re-builder.) Also: Not everything uses the same RE implementation, some use non-standard extensions.

Don't become too drunk with power yet, though! There's a lot they're just not capable of doing (most notably, balancing tags around arbitrarily nested expressions). Once you have REs down, learning a lexer/parser is the next step. :)

gruseom · on Sept 18, 2008

If you use Emacs, it's particularly easy to learn them by experimenting with M-x re-builder.

Hey! I did not know about this, and it looks really useful. I've used http://www.newartisans.com/blog_files/regex.tool.for.emacs.p... and it's good, but not as well integrated.

I wish I knew a better way to get a survey of what's available in the Emacs world - something that would have clued me in to this command, for example... any tips?

Edit: I want to comment on this as well:

learning a lexer/parser is the next step

Recently I've had occasion to write a few parsers. In the past, I'd always used parser generators (yacc and antlr) on the assumption that they made things easier. For this project that wasn't an option, so I bit the bullet and did the recursive descent thing. To my surprise, it turned out to be way, way easier than I expected. Moreover, in at least one case (a complex one), the hand-written parser code turned out shorter than the (cl-) yacc version, as well as handling more cases and reporting errors better.

The morals of the story are: (1) Don't assume something is hard without trying it; (2) It's not hard to write a classical recursive-descent parser and the skill is a lot more useful than you may realize. Wish I'd learned that a long time ago.

mark_h · on Sept 18, 2008

I wish I knew a better way to get a survey of what's available in the Emacs world - something that would have clued me in to this command, for example... any tips?

http://planet.emacsen.org/ often has useful stuff.

The emacs wiki (http://emacswiki.org/) is often where you end up after googling and has a heap of information (although even it isn't complete!). You could try http://www.emacswiki.org/cgi-bin/wiki/RandomPage when you're bored :)

litewulf · on Sept 18, 2008

That being said, RDPs have quite a bit of boilerplate that may obscure your grammar. (And small bugs in a RDP are sometimes very hard to catch...)

silentbicycle · on Sept 18, 2008

I second Emacswiki as suggested, but I find quite a bit by searching through the Emacs and Elisp info pages and using M-x apropos / M-x apropos-variable. It doesn't always work though - to find something, you generally need to know what Emacs calls it.

tdavis · on Sept 18, 2008

Exactly! Learn how to write regular expressions because they are super useful, then immediately disregard the boiler-plate ones provided in this article because any self-respecting language's stdlib (or extremely common libraries/frameworks) will handle all these specific cases in an abstracted way. Not to mention, learning only these doesn't mean you understand regex and not all the features of most regex engines are covered here (lookaheads, lookbacks, etc. and so on)

Darmani · on Sept 18, 2008

If I understand what you mean, you actually can balance tags around arbitrarily nested expressions if your regex parser supports recursion. The Ruby Way has an example of using recursive regexes to balance parentheses.

silentbicycle · on Sept 18, 2008

Once your regexes are getting that complicated, though, you're probably better off specifying things in terms of a BNF-based parser (or a recursive-descent parser, depending on the grammar).

jimbokun · on Sept 18, 2008

The article seems like a good tutorial for learning regex constructions, with the examples chosen good motivation for learning.

BrandonM · on Sept 18, 2008

The first tip is okay, but the latter 4 are just horrible. Any decent language will have libraries for parsing HTML or e-mails addresses. A regex is sure to come up short and be very fragile.

I once had to maintain some screen-scraping code that was written in Python using regular expressions. By the time I inherited it, half of the functionality no longer worked. It would have been much better off using a library like BeautifulSoup, both in terms of development time and maintainability.

BeautifulSoup alone takes care of REs 2 and 3, and there are standard libraries in Python that take care of 4 and 5. Why reinvent (less robustly, I might add) the wheel when a simple API already exists in many languages?

Erwin · on Sept 18, 2008

I wouldn't use a single regexp for complete username validation -- if it fails, all you can display is a generic "username is not valid, it must obey rules X, Y & Z" message. I'd check min and max length separately and display an appropriate error message for that.

Also ignore leading/trailing spaces; or you'll end up puzzled why you have two "bob@example.org" users in your database even with appropriate database constraint, and bob mails you saying he can't login on the account he just paid for.

antirez · on Sept 18, 2008

5 Regular Expressions Every Web Programmer should now how to write... if he claims I knows regexps. The problem is that in my experience a minimal part of web developers really know regexps.

tocomment · on Sept 18, 2008

meh disappointed. Could any of you guys give me one liners for email and URL validation?

olavk · on Sept 18, 2008

The value of regex-validating emails and urls are limited. Validation is mostly useful for flagging completely malformed input, like people writing "asdf" or "I dont have an email, plz use fed-ex.".

Regex validation will not catch common typos (or made-up emails) because they tend to look like correct emails. So the intricate, syntactically correct email validation regexes are a solution in search of a problem.

If you really want to be sure that the email is correct, you have to send a confirmation mail. For your input validation, just check that there is a "@" somewhere.

Jasber · on Sept 18, 2008

People also try to look up the MX records for the domain to see if it has a valid mail server. This works 95% of the time--but breaks in certain cases leading to false positives.

As olavk said, just do basic validation (@ exists) and send off a confirmation e-mail. It works 100% of the time.

iigs · on Sept 18, 2008

Surprisingly, this is fairly hard to do with regular expressions. You really want to outsource this to someone else's library.

But, if you must, here's the regex from one library: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

Not exactly a one liner, though.

briansmith · on Sept 18, 2008

Don't use regular expressions for those things if you can avoid it.

See http://www.regular-expressions.info/email.html.

My URL (IRI) validation regex is a very complicated work in progress and depends on being post-processed by a Python function to handle Unicode characters outside the BMP. It has hundreds of lines of unit tests (not shown):

  unreserved = ur'(?:[A-Za-z0-9\-\._~]|%s)' % ucschar
  pct_encoded= ur'%[0-9A-Fa-f]{2}'
  sub_delims = ur"[!$&'()*+,;=]"

  pchar      = r'(?:%s|%s|%s|[:@])' % (unreserved, pct_encoded, sub_delims)

  dec_octet   = ur'(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])'
  ipv4address = ur'%s\.%s\.%s\.%s' % (dec_octet, dec_octet, dec_octet, dec_octet)
  h16         = ur'(?:[0-9A-Fa-f]{1,4})'
              # 16 bits of address represented in hexadecima
  ls32        = ur'(?:%s:%s|%s)' % (h16, h16, ipv4address)
              # least-significant 32 bits of address
  ipv6address =   ( u'(?:' +                    ur'(?:h16:){6}ls32'
                  + u'|' +                    ur'::(?:h16:){5}ls32'
                  + u'|' +              ur'(h16)?::(?:h16:){4}ls32'
                  + u'|' + ur'((?:h16:){0,1}h16)?::(?:h16:){3}ls32'
                  + u'|' + ur'((?:h16:){0,2}h16)?::(?:h16:){2}ls32'
                  + u'|' + ur'((?:h16:){0,3}h16)?::(?:h16:){1}ls32'
                  + u'|' + ur'((?:h16:){0,4}h16)?::ls32'
                  + u'|' + ur'((?:h16:){0,5}h16)?::h16'
                  + u'|' + ur'((?:h16:){0,6}h16)?::'
                  + u')'
                  ).replace(u"h16", h16).replace(u"ls32",ls32)
  ipvfuture   = ur'v[0-9A-Fa-f]+\.(?:%s|%s|:)+' % (unreserved, sub_delims)
  ip_literal  = ur'\[(?:%s|%s)\]' % (ipv6address, ipvfuture)

  scheme     = ur'(?:[A-Za-z][A-Za-z0-9+\-\.]*)'

  userinfo   = ur'(?:%s|%s|%s|:)*'   % (unreserved, pct_encoded, sub_delims)
  reg_name   = ur'(?:%s|%s|%s)*'     % (unreserved, pct_encoded, sub_delims)

  host       = ur'(?:%s|%s|%s)' % (ip_literal, ipv4address, reg_name)
  port       = ur'[0-9]*'
  authority  = ur'(?:(?:%s@)?%s(?:\:%s)?)' % (userinfo, host, port)
  segment    = ur'%s*' % pchar
  segment_nz = ur'%s+' % pchar

  path_abempty  =          ur'(?:/%s)*'    % (            segment)
  path_absolute = ur'(?:/(?:%s(?:/%s)*)?)' % (segment_nz, segment)
  path_rootless =     ur'(?:%s(?:/%s)*)'   % (segment_nz, segment)
  path_empty    = ur''
  hier_part  = ur'(?://%s%s|%s|%s|%s)' % (authority, path_abempty, path_absolute, path_rootless, path_empty)
  private    = ur'[\ue000-\uF8FF]' # see replaceSurrogatePairs for handling of non-BMP chars
  query      = ur'(?:%s|[/?]|%s)*' % (pchar, private)
  fragment   = ur'(?:%s|[/?])*'    % (pchar)

  iri        = ur'%s:%s(?:\?%s)?(?:#%s)?' % (scheme, hier_part, query, fragment)

  segment_nz_nc = ur'(?:%s|%s|%s|@)+' % (unreserved, pct_encoded, sub_delims)
  path_noscheme = ur'(?:%s(/%s)*)' % (segment_nz_nc, segment)
  relative_part = ur'(?://%s%s|%s|%s|%s)' % (authority, path_abempty, path_absolute, path_noscheme, path_empty)
  relative_ref  = ur'(?:%s(?:\?%s)?(?:#%s)?)' % (relative_part, query, fragment)

  iri_reference = ur'(?:%s|%s)' % (iri, relative_ref)

alexk · on Sept 18, 2008

why don't you use any parser generator (e.g. PLY) and write a grammar instead?

briansmith · on Sept 18, 2008

That is what I've done in the past with non-Python projects using ANTLR or other parser generator tools. I haven't done that yet because:

* I want to release the software that contains this validator and I want to limit the external build-time and (especially) run-time dependencies. So far, the application requires nothing except Python 2.4.

* The RE code is easier to unit test. When you use a parser generator you usually have to nominate start productions in your grammar. In order to unit test you have to add lots of extra start productions. That tends to substantially impact the readability, performance, and/or size of the generated code.

* I keep hearing that people are getting by with regular expressions to do complex stuff but traditionally I've always jumped directly to ANTLR and other similar systems. I wanted to try doing this just using regular expressions to get a good grasp of what those tools are buying me and to find out why so many people just end up using regular expressions. I found that, because this tool is just a simple validator/recognizer, those advanced tools don't really buy me much here. As you can see, the Python+RE code I posted directly mirrors the RFC 3987 grammar. If I switched to PLY or something similar, I wouldn't save anything in terms of LoC and readability wouldn't be that much improved.

* I've heard many times that it is best to minimize the amount of Python code that executes in the critical paths of a program; instead, we will get better performance by using native libraries (like the re module). Sometime I will probably try out a code generator as an experiment so that I can measure the performance of the re module vs. a bunch of generated Python code.

rcoder · on Sept 18, 2008

  > I've heard many times that it is best to minimize the
  > amount of Python code that executes in the critical 
  > paths of a program; instead, we will get better 
  > performance by using native libraries (like the re 
  > module).

Keep in mind that, while the 're' module itself may be written in C, it is in effect an interpreter for another language (regexps). You can easily build a regular expression that will happily backtrack and capture its way into a performance black hole, esp. since you can't trace the individual sub-patterns using normal Python profiling tools.

Benchmarking the two implementations, as you suggest, is of course the only way to be sure in your case.

alexk · on Sept 18, 2008

1 - Python parser generators often are a bunch of .py files, so you won't get anything external

2 - With PLY ( and some others) you can use any non-terminal as the start in the test mode, and it won't make any impact on the performance

4 - in case of python based parser generators - lexer is a weak part. Take PLY - it's lexer is based on re module, so you wont gain much here, you can write manual lexer and boost your performance.

briansmith · on Sept 18, 2008

1. That is typically the case with parser generators. But, it is still a build-time dependency. Right now I don't really have any build-time processing except for unit tests and I want to keep "building" as simple as possible.

2. That is good to know. That indicates to me that PLY doesn't do a lot of optimizations, but maybe the optimizations that I am used to having for LL(k) parsing are not effective or necessary for (LA)LR parsing.

4. That indicates to me that the performance won't be any better than the re-only approach. What isn't shown is that I compile the "iri" and "iri-reference" regexes into two big regular expressions. AFAICT, matching one match for a big regular expression instead of a series of matches of small regular expressions will always be faster, especially if the RE engine does any (global) optimizations.

By the way, an IRI is a single token in my application so my parser is really just a lexer.

hs · on Sept 18, 2008

i use this for email:

/[-\.\\w]*@\\w+(\.\\w+)+/

thwarted · on Sept 18, 2008

DO NOT USE THIS REGULAR EXPRESSION FOR ANYTHING.

This is a terrible example. First off, you're using double-quoted syntax, but it's not in double quotes. Additionally, you're confusing when regular expression metacharacters need to be escape because they're metacharacters and when they need to be escaped because they are in a double-quoted string. The PHP documentation is particularly terrible in this regard (telling you to put regular expressions in double quoted strings rather than single quoted, because PHP doesn't have a regular expression type).

[-\.\\w] means match a dash, a dot, a backslash and the w character.

Secondly, it's not anchored to the start or end of the input.

Thirdly, the LHS can be empty according to this regular expression.

Lastly, we've finally uncovered who it is that's keeping everyone from using + on the LHS to do sendmail style +folder references.

hs · on Sept 18, 2008

actually i didn't use PHP, the code i submitted is 'ported' using /regex/ ... i haven't tested it in PHP (i don't have) ... but that what i might use if i were forced to use PHP

the actual code is in newlisp: <code> (set 'p1 (regex-comp {[-\.\w]@\w+(\.\w+)+})) ;compile (map (lambda (f) (replace p1 (read-file f) (push $0 E) 0x10000) ) (directory "." "php|htm|html")) (set 'E (unique E)) (save "email.db" 'E) </code>

my newlisp code is not used for validation, it's for scrapping emails from ~30000 html pages totalling 870 MB <pre> ~/dl $ find . | wc 29546 29548 1423727 ~/dl $ du -h 870M </pre>

I used more complex regexes (slow) but only marginal improvement (< 0.1%) and a lot of noises

This code is pretty fast, almost an order magnitude faster than more complex, complete email regex

[-\.\\w] means ... match - or \. or word (a-zA-Z0-9) character ad nauseum (can put more chars inside [] but the regex is slower)

i didn't use it for validation, so no anchoring needed ... but it's an easy modification

anyway, this is the code i use for real world - i won't pretend that it's perfect for every corner cases, but hey it works fast for me :D short code solving 90+% cases

thwarted · on Sept 18, 2008

I don't know about the lisp code (where you use \w), but \\w matches a backslash and a w, as the first \ escapes the second. I wasn't saying you were using PHP, but I know the PHP docs were confusing on how to escape things because backslash is overloaded for string escaping and for regular expression escaping and character classes.

I can't see how a more complete (complex?) correct regular expression would produce MORE noise in your output. The very fact that you allow the empty string on the LHS means that you risk getting invalid addresses.

Continue to use that busted regular expression for parsing email addresses out of these 30,000 HTML files and please remove all my email addresses from the list you generate while you're at it.

hs · on Sept 18, 2008

don't worry, the htmls are ads pages where people deliberately put phone and email to contact to (public info)

the correct regex can indeed produce noise, consider this string in an ads where the poster uses '+' to join info: "please contact the owner, her info: 4041234567+owner@gmail.com ... thx!"

the correct 'boilerplate' regex will happily take 4041234567+owner@gmail.com ... which is not what i want; thus, noise

the less pedantic regex i use will just take owner@gmail.com, which is what i want

similarly with "?" char (legal for local part), the correct regex will catch "sendIM?owner@gmail.com"; which again, not what i want ...

the 'busted' regex will only take 'owner@gmail.com' ... which is what i want

that's the main reason the omissions of +,? and other legal chars in local part of my email regex, the second reason is speed

for email validation where what u get is just a string of less than 100 chars, it's fine to use complete email validation regex; however, for email scrapping of close to 1gb data, boilerplate won't cut it

i'm willing to sacrifice completeness for speed

my regex doesn't allow empty string, there's metachar '\s' for spaces, equivalent to [^ \t\n]

and i was wrong, '\w' is [^a-zA-Z0-9_] ... i forgot '^' and '_' are included in metachar '\w'

thx, your criticism forced me to reopen my regex book

sapphirecat · on Sept 18, 2008

> [-\.\\w] means match a dash, a dot, a backslash and the w character.

That one could very well depend on language:

    <?php var_dump(preg_match('/[\\w]/', 'Lies.')); ?>
    int(1)

Note that PHP still interprets \' and \\ in single-quotes, too.

thwarted · on Sept 19, 2008

Well, sure, but when you're talking about regular expressions and don't give them a context (which you did in your PHP example above), you should be using regular expression escaping syntax, not one that is based on a context that isn't mentioned. PCRE are the same no matter which language or application uses the library, how the language serializes and represents regular expression literals is an aspect of the language, not of the regular expression engine.

Oddly, it's really unintuitive to specify a character class as you have above that is composed of the backslash and the w character in PHP. The PHP single quoted string '\\w' and '\w' are exactly the same, and end up being the same to the regular expression engine. If you want to put a backslash in your regular expression, you need to quadruple it.

litewulf · on Sept 18, 2008

Depending on the regex function you use, you may or may not need the anchoring.

FWIW.

tdavis · on Sept 18, 2008

If ever someone wanted an example behind the logic "let the language / stdlib / mature framework handle email validation," here it is.