Incremental Regular Expressions (2012)

oconnore · on April 22, 2020

Similarly, I wish Regex engines had built in “partial match” support — i.e. is there some suffix that can be added to this string that results in a match?

I wrote something like this in Haskell for doing path traversal (descend into directory, or skip?). But it seems like a straightforward API addition to expose valid/non-valid states.

wahern · on April 22, 2020

The regex engine used by musl libc, TRE (https://github.com/laurikari/tre), supports approximate matching. musl doesn't seem to expose that capability, however.

ken · on April 23, 2020

This sounds like a perfect use case for object oriented programming. “I want something mostly like that thing that already exists, but with a slight variation.“ Or hooks, if you’re one of those weird Lisp programmers.

Unfortunately, virtually all regex engines I’ve ever seen were an opaque black box with no ability to modify the functionality at all.

apocalypstyx · on April 22, 2020

On a tangentially related note, more and more I find myself wishing regex syntax wasn't such a hodgepodge of literal/non-literal characters. It always seems to end up in a mess of what is and isn't escaped and knowing what needs to be in which context. And I keep thinking we really only need one special character, namely '\', to prefix all non-literals. So \\ produces '\', \. is dot match, \* is star match, etc. So instead of: .foo* we have \.foo\*, which admittedly is going to end up more verbose in most situations, but still strikes me as clearer as to what's intended.

chubot · on April 22, 2020

The Oil shell fixes that, and it compiles back to the old syntax so it seamlessly works with 'grep', etc.

http://www.oilshell.org/release/latest/doc/eggex.html

Point 2: Their syntax is vastly simpler because literal characters are quoted, and operators are not. For example, ^ no longer means three totally different things. See the critique at the end of this doc.

(In other words, it's similar to lex or re2c syntax)

----

Another example:

http://www.oilshell.org/blog/2019/12/22.html

The Eggex syntax is independent of Oil, so you can implement as a library in another language. I'm interested in help too:

https://github.com/oilshell/oil/issues?q=eggex+label%3Aeggex

One easy contribution is to to translate it back to PCRE syntax, because that's a very common syntax that people care about. Right now it translates to ERE, which works with egrep, awk, and GNU sed --regexp-extended.

It should be a trivial change to print the eggex AST in a hundred lines of code or so (really the hard part is testing).

j88439h84 · on April 23, 2020

Interesting idea. What are the benefits of defining a fancy new regex-like DSL instead of just using context-free grammars specified in EBNF?

chubot · on April 23, 2020

Eggex gives you a more EBNF-like syntax for regexes. CFGs are basically regular languages with recursion, so there's no reason for them to have a wildly different Perl-like syntax. That's basically a historical accident.

If you've written CFGs, then eggex will be very familiar. See the example: http://www.oilshell.org/blog/2019/12/22.html

-----

Also, regular languages are straightforward and tractable, and you should use them where possible (aside from bad syntax, which Eggex is fixing).

In contrast, CFGs are useful, but if you start using them a lot, you will see they're not really "one thing" as far as programming/engineering is concerned. They sort of explode into a fractal of complexity -- there are many different subsets of CFGs, some of which can be parsed efficiently with particular algorithms.

And some people reject CFG in favor of PEG, etc.

https://github.com/oilshell/oil/wiki/Parsing-Models-Cheatshe...

----

That is, the recursive grammar issue is not "settled". Whereas regular expressions have well known algorithms and are "settled" (except for syntax).

https://swtch.com/~rsc/regexp/

So basically regexes have bad syntax but relatively good semantics (libc BRE/ERE doesn't have the Perl backtracking issue). CFGs don't have a consistently bad syntax, but the semantics are much more complex when you're talking recognizing vs. generating. And of course in programming (rather than math), we want to recognize, not generate.

https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-... (Do the easy thing with the fast algorithm, and the hard thing with the slow algorithm. Don't do the easy thing with the slow algorithm.)

EvilTerran · on April 22, 2020

I find PCRE's rule easy enough: for a character you want to be taken literally, if it's punctuation backslash it, if it's not then don't; any unbackslashed punctuation (or backslashed non-punctuation) may be a metacharacter.

Compare to, say, POSIX or Vim[0] REs, where some punctuation characters are special with a backslash, others are special without, and I can never remember which is which.

[0] Regardless of the state of the "magic" option - the only way to get consistent behavior is to start every single RE with either \v (which works like PCRE) or \V (which works like your proposal).

Groxx · on April 23, 2020

You can get a fair distance with character classes[1] like `[[:digit:]]` and free-spacing[2] with `/x` or `(?x)`, where supported.

And/or yeah, build regexes out of descriptive fragments. It doesn't take much effort to make them readable, you don't have to have one giant blob of !(#&$()*. Plus then your regexes become relatively easily-validated teaching tools for readers.

[1]: https://www.regular-expressions.info/posixbrackets.html#clas...

[2]: https://www.regular-expressions.info/freespacing.html

Drup · on April 23, 2020

There is a really easy solution to this: write regexes with combinators.

  star (range 'a' 'z')

Naturally, it's trivial to have a `posix` combinator (or any other syntax, really), which allow you to be compact when things are trivial, explicit when they are not, or even a mix of both. You also don't have to conflate parens for priority and capture anymore, and you can create high level combinators.

For instance, this is real code to build a regex parsing URLs (translated in a JS-ish syntax):

  let cset = chars => compl(set(chars));
  let prefix = (head, tail) => char(head) ++ rep(tail);
  let scheme = rep(cset("/:?#")) ++ str("://");
  let host = rep(cset("/:?#"));
  let port = prefix(':', digit);
  let path = rep(prefix('/', cset("/?#")));
  let query = prefix('?', cset("#"));
  let fragment = prefix('#', any);
  let url = scheme ++ host ++ opt(port) ++ path ++ opt(query) ++ opt(fragment);

It's less compact, but it's still as fast and it's actually readable/maintainable. It's all the advantages of Perl's regex, without having to suffer Perl's taste in syntax.

m463 · on April 22, 2020

That is my main problem too. And it doesn't get easier when you're using multiple layers, such as when you quote and escape for the shell to pass it to sed or grep.

Over time, I find using r'foo' in python's re.match() or re.search() has been more precise and less of a burden than I thought compared to languages that integrate regex's directly

dang · on April 22, 2020

Discussed at the time: https://news.ycombinator.com/item?id=4964496