Similarly, I wish Regex engines had built in “partial match” support — i.e. is there some suffix that can be added to this string that results in a match?
I wrote something like this in Haskell for doing path traversal (descend into directory, or skip?). But it seems like a straightforward API addition to expose valid/non-valid states.
The regex engine used by musl libc, TRE (https://github.com/laurikari/tre), supports approximate matching. musl doesn't seem to expose that capability, however.
This sounds like a perfect use case for object oriented programming. “I want something mostly like that thing that already exists, but with a slight variation.“ Or hooks, if you’re one of those weird Lisp programmers.
Unfortunately, virtually all regex engines I’ve ever seen were an opaque black box with no ability to modify the functionality at all.
On a tangentially related note, more and more I find myself wishing regex syntax wasn't such a hodgepodge of literal/non-literal characters. It always seems to end up in a mess of what is and isn't escaped and knowing what needs to be in which context. And I keep thinking we really only need one special character, namely '\', to prefix all non-literals. So \\ produces '\', \. is dot match, \* is star match, etc. So instead of: .foo* we have \.foo\*, which admittedly is going to end up more verbose in most situations, but still strikes me as clearer as to what's intended.
Point 2: Their syntax is vastly simpler because literal characters are quoted, and operators are not. For example, ^ no longer means three totally different things. See the critique at the end of this doc.
(In other words, it's similar to lex or re2c syntax)
One easy contribution is to to translate it back to PCRE syntax, because that's a very common syntax that people care about. Right now it translates to ERE, which works with egrep, awk, and GNU sed --regexp-extended.
It should be a trivial change to print the eggex AST in a hundred lines of code or so (really the hard part is testing).
Eggex gives you a more EBNF-like syntax for regexes. CFGs are basically regular languages with recursion, so there's no reason for them to have a wildly different Perl-like syntax. That's basically a historical accident.
Also, regular languages are straightforward and tractable, and you should use them where possible (aside from bad syntax, which Eggex is fixing).
In contrast, CFGs are useful, but if you start using them a lot, you will see they're not really "one thing" as far as programming/engineering is concerned. They sort of explode into a fractal of complexity -- there are many different subsets of CFGs, some of which can be parsed efficiently with particular algorithms.
So basically regexes have bad syntax but relatively good semantics (libc BRE/ERE doesn't have the Perl backtracking issue). CFGs don't have a consistently bad syntax, but the semantics are much more complex when you're talking recognizing vs. generating. And of course in programming (rather than math), we want to recognize, not generate.
I find PCRE's rule easy enough: for a character you want to be taken literally, if it's punctuation backslash it, if it's not then don't; any unbackslashed punctuation (or backslashed non-punctuation) may be a metacharacter.
Compare to, say, POSIX or Vim[0] REs, where some punctuation characters are special with a backslash, others are special without, and I can never remember which is which.
[0] Regardless of the state of the "magic" option - the only way to get consistent behavior is to start every single RE with either \v (which works like PCRE) or \V (which works like your proposal).
You can get a fair distance with character classes[1] like `[[:digit:]]` and free-spacing[2] with `/x` or `(?x)`, where supported.
And/or yeah, build regexes out of descriptive fragments. It doesn't take much effort to make them readable, you don't have to have one giant blob of !(#&$()*. Plus then your regexes become relatively easily-validated teaching tools for readers.
There is a really easy solution to this: write regexes with combinators.
star (range 'a' 'z')
Naturally, it's trivial to have a `posix` combinator (or any other syntax, really), which allow you to be compact when things are trivial, explicit when they are not, or even a mix of both. You also don't have to conflate parens for priority and capture anymore, and you can create high level combinators.
For instance, this is real code to build a regex parsing URLs (translated in a JS-ish syntax):
let cset = chars => compl(set(chars));
let prefix = (head, tail) => char(head) ++ rep(tail);
let scheme = rep(cset("/:?#")) ++ str("://");
let host = rep(cset("/:?#"));
let port = prefix(':', digit);
let path = rep(prefix('/', cset("/?#")));
let query = prefix('?', cset("#"));
let fragment = prefix('#', any);
let url = scheme ++ host ++ opt(port) ++ path ++ opt(query) ++ opt(fragment);
It's less compact, but it's still as fast and it's actually readable/maintainable.
It's all the advantages of Perl's regex, without having to suffer Perl's taste in syntax.
That is my main problem too. And it doesn't get easier when you're using multiple layers, such as when you quote and escape for the shell to pass it to sed or grep.
Over time, I find using r'foo' in python's re.match() or re.search() has been more precise and less of a burden than I thought compared to languages that integrate regex's directly
I wrote something like this in Haskell for doing path traversal (descend into directory, or skip?). But it seems like a straightforward API addition to expose valid/non-valid states.