Instead of learning $bullet_point_count regular expressions by rote, just learn how they work and make your own as you need them. They really aren't hard to use, they're just incredibly compact, which makes them look unintelligible at first.
_Mastering Regular Expressions_ is the obvious classic, but a decent intro to Perl or Python will probably get to them eventually. (If you use Emacs, it's particularly easy to learn them by experimenting with M-x re-builder.) Also: Not everything uses the same RE implementation, some use non-standard extensions.
Don't become too drunk with power yet, though! There's a lot they're just not capable of doing (most notably, balancing tags around arbitrarily nested expressions). Once you have REs down, learning a lexer/parser is the next step. :)
I wish I knew a better way to get a survey of what's available in the Emacs world - something that would have clued me in to this command, for example... any tips?
Edit: I want to comment on this as well:
learning a lexer/parser is the next step
Recently I've had occasion to write a few parsers. In the past, I'd always used parser generators (yacc and antlr) on the assumption that they made things easier. For this project that wasn't an option, so I bit the bullet and did the recursive descent thing. To my surprise, it turned out to be way, way easier than I expected. Moreover, in at least one case (a complex one), the hand-written parser code turned out shorter than the (cl-) yacc version, as well as handling more cases and reporting errors better.
The morals of the story are: (1) Don't assume something is hard without trying it; (2) It's not hard to write a classical recursive-descent parser and the skill is a lot more useful than you may realize. Wish I'd learned that a long time ago.
I wish I knew a better way to get a survey of what's available in the Emacs world - something that would have clued me in to this command, for example... any tips?
I second Emacswiki as suggested, but I find quite a bit by searching through the Emacs and Elisp info pages and using M-x apropos / M-x apropos-variable. It doesn't always work though - to find something, you generally need to know what Emacs calls it.
Exactly! Learn how to write regular expressions because they are super useful, then immediately disregard the boiler-plate ones provided in this article because any self-respecting language's stdlib (or extremely common libraries/frameworks) will handle all these specific cases in an abstracted way. Not to mention, learning only these doesn't mean you understand regex and not all the features of most regex engines are covered here (lookaheads, lookbacks, etc. and so on)
If I understand what you mean, you actually can balance tags around arbitrarily nested expressions if your regex parser supports recursion. The Ruby Way has an example of using recursive regexes to balance parentheses.
Once your regexes are getting that complicated, though, you're probably better off specifying things in terms of a BNF-based parser (or a recursive-descent parser, depending on the grammar).
The first tip is okay, but the latter 4 are just horrible. Any decent language will have libraries for parsing HTML or e-mails addresses. A regex is sure to come up short and be very fragile.
I once had to maintain some screen-scraping code that was written in Python using regular expressions. By the time I inherited it, half of the functionality no longer worked. It would have been much better off using a library like BeautifulSoup, both in terms of development time and maintainability.
BeautifulSoup alone takes care of REs 2 and 3, and there are standard libraries in Python that take care of 4 and 5. Why reinvent (less robustly, I might add) the wheel when a simple API already exists in many languages?
I wouldn't use a single regexp for complete username validation -- if it fails, all you can display is a generic "username is not valid, it must obey rules X, Y & Z" message. I'd check min and max length separately and display an appropriate error message for that.
Also ignore leading/trailing spaces; or you'll end up puzzled why you have two "bob@example.org" users in your database even with appropriate database constraint, and bob mails you saying he can't login on the account he just paid for.
5 Regular Expressions Every Web Programmer should now how to write... if he claims I knows regexps. The problem is that in my experience a minimal part of web developers really know regexps.
The value of regex-validating emails and urls are limited. Validation is mostly useful for flagging completely malformed input, like people writing "asdf" or "I dont have an email, plz use fed-ex.".
Regex validation will not catch common typos (or made-up emails) because they tend to look like correct emails. So the intricate, syntactically correct email validation regexes are a solution in search of a problem.
If you really want to be sure that the email is correct, you have to send a confirmation mail. For your input validation, just check that there is a "@" somewhere.
People also try to look up the MX records for the domain to see if it has a valid mail server. This works 95% of the time--but breaks in certain cases leading to false positives.
As olavk said, just do basic validation (@ exists) and send off a confirmation e-mail. It works 100% of the time.
My URL (IRI) validation regex is a very complicated work in progress and depends on being post-processed by a Python function to handle Unicode characters outside the BMP. It has hundreds of lines of unit tests (not shown):
That is what I've done in the past with non-Python projects using ANTLR or other parser generator tools. I haven't done that yet because:
* I want to release the software that contains this validator and I want to limit the external build-time and (especially) run-time dependencies. So far, the application requires nothing except Python 2.4.
* The RE code is easier to unit test. When you use a parser generator you usually have to nominate start productions in your grammar. In order to unit test you have to add lots of extra start productions. That tends to substantially impact the readability, performance, and/or size of the generated code.
* I keep hearing that people are getting by with regular expressions to do complex stuff but traditionally I've always jumped directly to ANTLR and other similar systems. I wanted to try doing this just using regular expressions to get a good grasp of what those tools are buying me and to find out why so many people just end up using regular expressions. I found that, because this tool is just a simple validator/recognizer, those advanced tools don't really buy me much here. As you can see, the Python+RE code I posted directly mirrors the RFC 3987 grammar. If I switched to PLY or something similar, I wouldn't save anything in terms of LoC and readability wouldn't be that much improved.
* I've heard many times that it is best to minimize the amount of Python code that executes in the critical paths of a program; instead, we will get better performance by using native libraries (like the re module). Sometime I will probably try out a code generator as an experiment so that I can measure the performance of the re module vs. a bunch of generated Python code.
> I've heard many times that it is best to minimize the
> amount of Python code that executes in the critical
> paths of a program; instead, we will get better
> performance by using native libraries (like the re
> module).
Keep in mind that, while the 're' module itself may be written in C, it is in effect an interpreter for another language (regexps). You can easily build a regular expression that will happily backtrack and capture its way into a performance black hole, esp. since you can't trace the individual sub-patterns using normal Python profiling tools.
Benchmarking the two implementations, as you suggest, is of course the only way to be sure in your case.
1 - Python parser generators often are a bunch of .py files, so you won't get anything external
2 - With PLY ( and some others) you can use any non-terminal as the start in the test mode, and it won't make any impact on the performance
4 - in case of python based parser generators - lexer is a weak part. Take PLY - it's lexer is based on re module, so you wont gain much here, you can write manual lexer and boost your performance.
1. That is typically the case with parser generators. But, it is still a build-time dependency. Right now I don't really have any build-time processing except for unit tests and I want to keep "building" as simple as possible.
2. That is good to know. That indicates to me that PLY doesn't do a lot of optimizations, but maybe the optimizations that I am used to having for LL(k) parsing are not effective or necessary for (LA)LR parsing.
4. That indicates to me that the performance won't be any better than the re-only approach. What isn't shown is that I compile the "iri" and "iri-reference" regexes into two big regular expressions. AFAICT, matching one match for a big regular expression instead of a series of matches of small regular expressions will always be faster, especially if the RE engine does any (global) optimizations.
By the way, an IRI is a single token in my application so my parser is really just a lexer.
This is a terrible example. First off, you're using double-quoted syntax, but it's not in double quotes. Additionally, you're confusing when regular expression metacharacters need to be escape because they're metacharacters and when they need to be escaped because they are in a double-quoted string. The PHP documentation is particularly terrible in this regard (telling you to put regular expressions in double quoted strings rather than single quoted, because PHP doesn't have a regular expression type).
[-\.\\w] means match a dash, a dot, a backslash and the w character.
Secondly, it's not anchored to the start or end of the input.
Thirdly, the LHS can be empty according to this regular expression.
Lastly, we've finally uncovered who it is that's keeping everyone from using + on the LHS to do sendmail style +folder references.
actually i didn't use PHP, the code i submitted is 'ported' using /regex/ ... i haven't tested it in PHP (i don't have) ... but that what i might use if i were forced to use PHP
the actual code is in newlisp:
<code>
(set 'p1 (regex-comp {[-\.\w]@\w+(\.\w+)+})) ;compile
(map (lambda (f)
(replace p1 (read-file f) (push $0 E) 0x10000)
) (directory "." "php|htm|html"))
(set 'E (unique E))
(save "email.db" 'E)
</code>
my newlisp code is not used for validation, it's for scrapping emails from ~30000 html pages totalling 870 MB
<pre>
~/dl $ find . | wc
29546 29548 1423727
~/dl $ du -h
870M
</pre>
I used more complex regexes (slow) but only marginal improvement (< 0.1%) and a lot of noises
This code is pretty fast, almost an order magnitude faster than more complex, complete email regex
[-\.\\w] means ... match - or \. or word (a-zA-Z0-9) character ad nauseum (can put more chars inside [] but the regex is slower)
i didn't use it for validation, so no anchoring needed ... but it's an easy modification
anyway, this is the code i use for real world - i won't pretend that it's perfect for every corner cases, but hey it works fast for me :D short code solving 90+% cases
I don't know about the lisp code (where you use \w), but \\w matches a backslash and a w, as the first \ escapes the second. I wasn't saying you were using PHP, but I know the PHP docs were confusing on how to escape things because backslash is overloaded for string escaping and for regular expression escaping and character classes.
I can't see how a more complete (complex?) correct regular expression would produce MORE noise in your output. The very fact that you allow the empty string on the LHS means that you risk getting invalid addresses.
Continue to use that busted regular expression for parsing email addresses out of these 30,000 HTML files and please remove all my email addresses from the list you generate while you're at it.
don't worry, the htmls are ads pages where people deliberately put phone and email to contact to (public info)
the correct regex can indeed produce noise, consider this string in an ads where the poster uses '+' to join info:
"please contact the owner, her info: 4041234567+owner@gmail.com ... thx!"
the correct 'boilerplate' regex will happily take 4041234567+owner@gmail.com ... which is not what i want; thus, noise
the less pedantic regex i use will just take owner@gmail.com, which is what i want
similarly with "?" char (legal for local part), the correct regex will catch "sendIM?owner@gmail.com"; which again, not what i want ...
the 'busted' regex will only take 'owner@gmail.com' ... which is what i want
that's the main reason the omissions of +,? and other legal chars in local part of my email regex, the second reason is speed
for email validation where what u get is just a string of less than 100 chars, it's fine to use complete email validation regex; however, for email scrapping of close to 1gb data, boilerplate won't cut it
i'm willing to sacrifice completeness for speed
my regex doesn't allow empty string, there's metachar '\s' for spaces, equivalent to [^ \t\n]
and i was wrong, '\w' is [^a-zA-Z0-9_] ... i forgot '^' and '_' are included in metachar '\w'
thx, your criticism forced me to reopen my regex book
Well, sure, but when you're talking about regular expressions and don't give them a context (which you did in your PHP example above), you should be using regular expression escaping syntax, not one that is based on a context that isn't mentioned. PCRE are the same no matter which language or application uses the library, how the language serializes and represents regular expression literals is an aspect of the language, not of the regular expression engine.
Oddly, it's really unintuitive to specify a character class as you have above that is composed of the backslash and the w character in PHP. The PHP single quoted string '\\w' and '\w' are exactly the same, and end up being the same to the regular expression engine. If you want to put a backslash in your regular expression, you need to quadruple it.
_Mastering Regular Expressions_ is the obvious classic, but a decent intro to Perl or Python will probably get to them eventually. (If you use Emacs, it's particularly easy to learn them by experimenting with M-x re-builder.) Also: Not everything uses the same RE implementation, some use non-standard extensions.
Don't become too drunk with power yet, though! There's a lot they're just not capable of doing (most notably, balancing tags around arbitrarily nested expressions). Once you have REs down, learning a lexer/parser is the next step. :)