Fair point about the parenthetical, though I would say both the engine and the language are crucial, depending on what role you're playing in the system.
The setup I think of: you have "domain experts" writing policies with regexes, software engineers who embed the engine in an application, and operators of the entire service. My usages of regexes for "big data" followed that pattern, and I imagine the DPI applications of Hyperscan do too.
A lot of those rules are accumulated over time, are not easy to rewrite, and require testing. If you have a thousand regexes in one dialect painfully crafted over years, you can't just change the engine. It would break the whole application.
So that is why I emphasize the language. Backtracking engines are more common, and the language and the "code" written in the language has to be changed before you can change the engine. That happened at least partially at Google 10+ years ago, migrating from PCRE to RE2.
But yes, if you're an operator, you will be looking at it from the engine perspective. Obviously it is not enough to change the patterns themselves.
(The backreference point is interesting, although I'll probably leave it out of my mental model until someone implements it :) )
----
Also the point of Eggex is to treat regexes more abstractly, like a real language. Just like I don't write C++ code for GCC or Clang, I don't write regexes for a specific engine. I have ported regexes between engines many times.
The typical application I've worked on uses multiple engines. Hell a one line shell script using grep and awk uses multiple engines!!!
This is very true. It was something I admit I found more than a bit irritating; a number of Hyperscan DPI customers just started regarding Hyperscan as a "fact of nature" and would simply start coding around weaknesses or avoiding patterns that didn't work as well. I would much have preferred that they complain and have us fix things for them.
IMO there's still a lot of scope for a better regex engine; backtracking engines have a lot of expressiveness for better or worse and I'm not convinced that they are as bad as people say. Systems like Hyperscan are somewhat niche in that they throw a lot of stuff overboard in pursuit of scale (lots of regexes at once), performance and "streaming" (the ability to pause matching with a fixed-sized amount of context and restart). Despite its hype, RE2 is not that great either - it's a couple engines taped together with some fairly opinionated design policies that has become big enough to be a "fact of nature" at a big influential company ("I guess that's how a regex engine works").
Hm I don't really think RE2 is even that influential (unfortunately IMO). Rust is the only new language whose engine borrows its design AFAIK. For example, I was disappointed to learn that Julia uses PCRE.
----
I think backtracking is firmly in the past because it causes too many problems in practice [1], but I think there's scope for a text matching engine that's built on top of regular languages, but more expressive.
As mentioned here and on lobste.rs recently, regular languages aren't expressive enough for syntax highlighting (or lexing most programming languages in their entirety).
But it doesn't take much to make them more expressive, WHILE meaning these properties:
1. linear-time and constant space
2. Composition by OR
Backtracking fails at both properties. Backtracking is bad. Like you say with backreferences, there should be something more expressive than regular languages that maintains both properties (and it's likely to be useful).
Example of tiny mechanism on top of regular languages:
The compile time and code size are linear with respect to the number of patterns. What was very surprising to me is that the compiled C code starts off slower than interpreters, but scales better as you increase the number of patterns.
(2) Write a WebAssembly backend for it. Compiling its C output to WebAssembly is not likely to be great, because it uses gotos on every line, and gotos in WebAssembly need the Relooper algorithm last time I checked.
(3) Now add a linear-space constant-time VM on top. As a test of this property, it should be able to express LL(1) parsing and LALR(1) parsing. Something like "lexer modes" should also fall out trivially.
I believe this strategy saves a bunch of work, and has some desirable properties:
- an entire RE front end, handling captures and such in DFAs: https://arxiv.org/abs/1907.08837 . I'm sure re2c is slower than Hyperscan, but as you hint, I think it's also more general and widely used.
- wasm could be more palatable to embed than other VMs. There are multiple implementations of it.
- wasm avoids dealing with C code and assembly code. It has a few high quality JITs.
- The first version of wasm has a ton of limitations (lack of GC, rich interfaces to the host), but this application hits none of them. It's a perfect problem for it.
Actually one more thing I realized about this idea that would be wild...
Since re2c itself is written in C, it can be easily compiled to wasm (even if it has a few gotos). So then you get an engine that can compile user patterns at runtime, rather than having it be an offline process.
(I don't really think there is that much difference in constraints between the offline and online processes, as some of have suggested. For example I observed that re2c compile time is linear with respect # constants strings in a large range in that benchmark.)
I have been thinking of embedding wasm in Oil, mainly to do reasonably fast string transformations. Because one goal of Oil is to provide an alternative to extremely tortured string transformation syntax like: https://github.com/dylanaraps/pure-bash-bible
And Oil has the eggex language, that compiles to egrep syntax, and uses libc's engine.
But it would be cool if it could compile to DFAs with an embedded re2c compiler.
Then again, I was surprised that the compiled C code was not faster than an intepreter for small cases, so maybe this whole thing is too complex for the benefit... I guess I would have to do some up front benchmarks to see if this is even worth it.
It could be that you need an extremely fast wasm engine to make it worthwhile, and all of those are tied to big JS engines and their very non-portable JITs.
Surprise, the "totally shocking" result there is due to BACKTRACKING :) Automata-based regex in wasm beats native browser regex because it doesn't backtrack.
I'd be interested in your feedback on it! I think it's very related to Lingua Franca but if you disagree I'm also interested in that :)
I think most "new language" projects are rightly met with skepticism, but the difference with Oil is that is emphasizes a transition path, and it's embedded in a significant piece of work (i.e. there are other projects like this, including ones from myself, but ironically I didn't want the extra dependency)
Although I guess the obvious criticism is that Eggex almost entirely syntactic. If two regex engines have the same syntax but different semantics, it doesn't help you. But I have not found that to be a big issue in decades of using multiple engines, although admittely it relates to the subset of regexes I use. Eggex is more like a union of engines than an intersection, so maybe that's something to change ...
The setup I think of: you have "domain experts" writing policies with regexes, software engineers who embed the engine in an application, and operators of the entire service. My usages of regexes for "big data" followed that pattern, and I imagine the DPI applications of Hyperscan do too.
A lot of those rules are accumulated over time, are not easy to rewrite, and require testing. If you have a thousand regexes in one dialect painfully crafted over years, you can't just change the engine. It would break the whole application.
So that is why I emphasize the language. Backtracking engines are more common, and the language and the "code" written in the language has to be changed before you can change the engine. That happened at least partially at Google 10+ years ago, migrating from PCRE to RE2.
But yes, if you're an operator, you will be looking at it from the engine perspective. Obviously it is not enough to change the patterns themselves.
(The backreference point is interesting, although I'll probably leave it out of my mental model until someone implements it :) )
----
Also the point of Eggex is to treat regexes more abstractly, like a real language. Just like I don't write C++ code for GCC or Clang, I don't write regexes for a specific engine. I have ported regexes between engines many times.
The typical application I've worked on uses multiple engines. Hell a one line shell script using grep and awk uses multiple engines!!!
Tangent: reminds me of this paper: https://www3.cs.stonybrook.edu/~dongyoon/papers/FSE-19-Lingu...