Fair point about the parenthetical, though I would say both the engine and the l...

glangdale · on June 30, 2020

This is very true. It was something I admit I found more than a bit irritating; a number of Hyperscan DPI customers just started regarding Hyperscan as a "fact of nature" and would simply start coding around weaknesses or avoiding patterns that didn't work as well. I would much have preferred that they complain and have us fix things for them.

IMO there's still a lot of scope for a better regex engine; backtracking engines have a lot of expressiveness for better or worse and I'm not convinced that they are as bad as people say. Systems like Hyperscan are somewhat niche in that they throw a lot of stuff overboard in pursuit of scale (lots of regexes at once), performance and "streaming" (the ability to pause matching with a fixed-sized amount of context and restart). Despite its hype, RE2 is not that great either - it's a couple engines taped together with some fairly opinionated design policies that has become big enough to be a "fact of nature" at a big influential company ("I guess that's how a regex engine works").

chubot · on June 30, 2020

Hm I don't really think RE2 is even that influential (unfortunately IMO). Rust is the only new language whose engine borrows its design AFAIK. For example, I was disappointed to learn that Julia uses PCRE.

----

I think backtracking is firmly in the past because it causes too many problems in practice [1], but I think there's scope for a text matching engine that's built on top of regular languages, but more expressive.

As mentioned here and on lobste.rs recently, regular languages aren't expressive enough for syntax highlighting (or lexing most programming languages in their entirety).

But it doesn't take much to make them more expressive, WHILE meaning these properties:

1. linear-time and constant space

2. Composition by OR

Backtracking fails at both properties. Backtracking is bad. Like you say with backreferences, there should be something more expressive than regular languages that maintains both properties (and it's likely to be useful).

Example of tiny mechanism on top of regular languages:

When Are Lexer Modes Useful? http://www.oilshell.org/blog/2017/12/17.html

https://lobste.rs/s/m24zv1/xi_editor_retrospective#c_p3zadg

-----

Here's an implementation idea for such a text matching engine:

(1) Start with re2c, because it has some different scaling properties than other engines https://news.ycombinator.com/item?id=23665569

The compile time and code size are linear with respect to the number of patterns. What was very surprising to me is that the compiled C code starts off slower than interpreters, but scales better as you increase the number of patterns.

(2) Write a WebAssembly backend for it. Compiling its C output to WebAssembly is not likely to be great, because it uses gotos on every line, and gotos in WebAssembly need the Relooper algorithm last time I checked.

(3) Now add a linear-space constant-time VM on top. As a test of this property, it should be able to express LL(1) parsing and LALR(1) parsing. Something like "lexer modes" should also fall out trivially.

I believe this strategy saves a bunch of work, and has some desirable properties:

- an entire RE front end, handling captures and such in DFAs: https://arxiv.org/abs/1907.08837 . I'm sure re2c is slower than Hyperscan, but as you hint, I think it's also more general and widely used.

- wasm could be more palatable to embed than other VMs. There are multiple implementations of it.

- wasm avoids dealing with C code and assembly code. It has a few high quality JITs.

- The first version of wasm has a ton of limitations (lack of GC, rich interfaces to the host), but this application hits none of them. It's a perfect problem for it.

(I'm also pretty sure this hasn't been done, but I could be wrong. I googled "re2c wasm" and found that it is used in some wasm tools, but I think that's unrelated: https://chromium.googlesource.com/external/github.com/WebAss... )

And it's also significantly different than RE2 or rust/regex, if that's your complaint :)

If any of that is interesting feel free to contact me at any method listed here: http://www.oilshell.org/ (email, zulip, etc.)

-----

[1] https://lobste.rs/s/m24zv1/xi_editor_retrospective#c_pyfpv1

chubot · on June 30, 2020

Actually one more thing I realized about this idea that would be wild...

Since re2c itself is written in C, it can be easily compiled to wasm (even if it has a few gotos). So then you get an engine that can compile user patterns at runtime, rather than having it be an offline process.

(I don't really think there is that much difference in constraints between the offline and online processes, as some of have suggested. For example I observed that re2c compile time is linear with respect # constants strings in a large range in that benchmark.)

I have been thinking of embedding wasm in Oil, mainly to do reasonably fast string transformations. Because one goal of Oil is to provide an alternative to extremely tortured string transformation syntax like: https://github.com/dylanaraps/pure-bash-bible

And Oil has the eggex language, that compiles to egrep syntax, and uses libc's engine.

But it would be cool if it could compile to DFAs with an embedded re2c compiler.

Then again, I was surprised that the compiled C code was not faster than an intepreter for small cases, so maybe this whole thing is too complex for the benefit... I guess I would have to do some up front benchmarks to see if this is even worth it.

It could be that you need an extremely fast wasm engine to make it worthwhile, and all of those are tied to big JS engines and their very non-portable JITs.

[1] https://johnwickerson.wordpress.com/2020/05/21/diagrams-for-...

----

edit: found some related comments:

https://news.ycombinator.com/item?id=18553393

Surprise, the "totally shocking" result there is due to BACKTRACKING :) Automata-based regex in wasm beats native browser regex because it doesn't backtrack.

davisjam314 · on June 30, 2020

----

Tangent: I am the author of the Lingua Franca paper. I hope you enjoyed it!

chubot · on June 30, 2020

Yes, it was very interesting, and related to my Eggex subproject (of the Oil shell, "our upgrade path from bash"):

http://www.oilshell.org/release/latest/doc/eggex.html

Basically the idea is to compile a new syntax to multiple engines. It's embedded in a shell and has a seamless upgrade path:

http://www.oilshell.org/release/latest/doc/eggex.html#exampl...

I'd be interested in your feedback on it! I think it's very related to Lingua Franca but if you disagree I'm also interested in that :)

I think most "new language" projects are rightly met with skepticism, but the difference with Oil is that is emphasizes a transition path, and it's embedded in a significant piece of work (i.e. there are other projects like this, including ones from myself, but ironically I didn't want the extra dependency)

Although I guess the obvious criticism is that Eggex almost entirely syntactic. If two regex engines have the same syntax but different semantics, it doesn't help you. But I have not found that to be a big issue in decades of using multiple engines, although admittely it relates to the subset of regexes I use. Eggex is more like a union of engines than an intersection, so maybe that's something to change ...