This is a terrible example. First off, you're using double-quoted syntax, but it's not in double quotes. Additionally, you're confusing when regular expression metacharacters need to be escape because they're metacharacters and when they need to be escaped because they are in a double-quoted string. The PHP documentation is particularly terrible in this regard (telling you to put regular expressions in double quoted strings rather than single quoted, because PHP doesn't have a regular expression type).
[-\.\\w] means match a dash, a dot, a backslash and the w character.
Secondly, it's not anchored to the start or end of the input.
Thirdly, the LHS can be empty according to this regular expression.
Lastly, we've finally uncovered who it is that's keeping everyone from using + on the LHS to do sendmail style +folder references.
actually i didn't use PHP, the code i submitted is 'ported' using /regex/ ... i haven't tested it in PHP (i don't have) ... but that what i might use if i were forced to use PHP
the actual code is in newlisp:
<code>
(set 'p1 (regex-comp {[-\.\w]@\w+(\.\w+)+})) ;compile
(map (lambda (f)
(replace p1 (read-file f) (push $0 E) 0x10000)
) (directory "." "php|htm|html"))
(set 'E (unique E))
(save "email.db" 'E)
</code>
my newlisp code is not used for validation, it's for scrapping emails from ~30000 html pages totalling 870 MB
<pre>
~/dl $ find . | wc
29546 29548 1423727
~/dl $ du -h
870M
</pre>
I used more complex regexes (slow) but only marginal improvement (< 0.1%) and a lot of noises
This code is pretty fast, almost an order magnitude faster than more complex, complete email regex
[-\.\\w] means ... match - or \. or word (a-zA-Z0-9) character ad nauseum (can put more chars inside [] but the regex is slower)
i didn't use it for validation, so no anchoring needed ... but it's an easy modification
anyway, this is the code i use for real world - i won't pretend that it's perfect for every corner cases, but hey it works fast for me :D short code solving 90+% cases
I don't know about the lisp code (where you use \w), but \\w matches a backslash and a w, as the first \ escapes the second. I wasn't saying you were using PHP, but I know the PHP docs were confusing on how to escape things because backslash is overloaded for string escaping and for regular expression escaping and character classes.
I can't see how a more complete (complex?) correct regular expression would produce MORE noise in your output. The very fact that you allow the empty string on the LHS means that you risk getting invalid addresses.
Continue to use that busted regular expression for parsing email addresses out of these 30,000 HTML files and please remove all my email addresses from the list you generate while you're at it.
don't worry, the htmls are ads pages where people deliberately put phone and email to contact to (public info)
the correct regex can indeed produce noise, consider this string in an ads where the poster uses '+' to join info:
"please contact the owner, her info: 4041234567+owner@gmail.com ... thx!"
the correct 'boilerplate' regex will happily take 4041234567+owner@gmail.com ... which is not what i want; thus, noise
the less pedantic regex i use will just take owner@gmail.com, which is what i want
similarly with "?" char (legal for local part), the correct regex will catch "sendIM?owner@gmail.com"; which again, not what i want ...
the 'busted' regex will only take 'owner@gmail.com' ... which is what i want
that's the main reason the omissions of +,? and other legal chars in local part of my email regex, the second reason is speed
for email validation where what u get is just a string of less than 100 chars, it's fine to use complete email validation regex; however, for email scrapping of close to 1gb data, boilerplate won't cut it
i'm willing to sacrifice completeness for speed
my regex doesn't allow empty string, there's metachar '\s' for spaces, equivalent to [^ \t\n]
and i was wrong, '\w' is [^a-zA-Z0-9_] ... i forgot '^' and '_' are included in metachar '\w'
thx, your criticism forced me to reopen my regex book
Well, sure, but when you're talking about regular expressions and don't give them a context (which you did in your PHP example above), you should be using regular expression escaping syntax, not one that is based on a context that isn't mentioned. PCRE are the same no matter which language or application uses the library, how the language serializes and represents regular expression literals is an aspect of the language, not of the regular expression engine.
Oddly, it's really unintuitive to specify a character class as you have above that is composed of the backslash and the w character in PHP. The PHP single quoted string '\\w' and '\w' are exactly the same, and end up being the same to the regular expression engine. If you want to put a backslash in your regular expression, you need to quadruple it.
This is a terrible example. First off, you're using double-quoted syntax, but it's not in double quotes. Additionally, you're confusing when regular expression metacharacters need to be escape because they're metacharacters and when they need to be escaped because they are in a double-quoted string. The PHP documentation is particularly terrible in this regard (telling you to put regular expressions in double quoted strings rather than single quoted, because PHP doesn't have a regular expression type).
[-\.\\w] means match a dash, a dot, a backslash and the w character.
Secondly, it's not anchored to the start or end of the input.
Thirdly, the LHS can be empty according to this regular expression.
Lastly, we've finally uncovered who it is that's keeping everyone from using + on the LHS to do sendmail style +folder references.