I've only given a quick read/skim, but this is extremely impressive. BurntSushi ...

btown · on July 5, 2023

https://www.cs.princeton.edu/courses/archive/fall19/cos226/l... and https://kean.blog/post/lets-build-regex are excellent introductions to implementing a (very) simplified regex engine: construct a nondetermistic finite state automaton for the regex, then perform a graph search on the resulting digraph; if the vertex corresponding to your end state is reachable, you have a match.

I think this exercise is valuable for anyone writing regexes to not only understand that there's less magic than one might think, but also to visualize a bunch of balls bouncing along an NFA - that bug you inevitably hit in production due to catastrophic backtracking now takes on a physical meaning!

Separately re: the OP, https://github.com/rust-lang/regex/issues/822 (and specifically BurntSushi's comment at the very end of the issue) adds really useful context to the paragraph in the OP about niche APIs: https://blog.burntsushi.net/regex-internals/#problem-request... - searching with multiple regexes simultaneously against a text is both incredibly complex and incredibly useful, and I can't wait to see what the community comes up with for this pattern!

jcranmer · on July 5, 2023

> Are regular expressions hard to read? Sometimes they can, especially if you haven't gone out of your way to take deep dives. Are they not perfect for a lot of parsing tasks? Sure, there are definitely things people shouldn't use regexs for that they do, like validating emails. At their core though, regular expressions are one of the most power dense tools we have in pretty much every language. The insane stuff you can get done (sometimes in a way that would be frowned upon) in so little time if just incredible.

The real crowning use case for regular expressions (at least in terms of parsing-like tasks) is when you've got formats with varied delimiters. So one format I was parsing this weekend was basically header:field1,field2,field3"data"hash (with a fixed number of fields). Or another format I work with is suite~split/test1,test2@opt1:opt2^hw1^hw2#flags1#flags2 (most of the elements of which are optional). Regular expressions shine here; using primitives like split don't quite cut it in these formats.

And I think this also is a major cause of why regular expressions tend to become unreadable quickly. If you look at parsing via regular expressions as a whole, there's basically three things that are being jammed into one expression: what are the delimiters between fields, what is valid in each field, and which fields are optional. These are basically three separate concerns, but most regex APIs are absolutely terrible at letting you separate these concerns into separate steps (all you can do is provide the string that combines all of them).

tialaramex · on July 5, 2023

> header:field1,field2,field3"data"hash (with a fixed number of fields).

In a language where regex is just right there, like Perl, I agree that the natural way to parse this is always a regex, although on the other hand in Perl the natural way to do almost everything is a regex so maybe Perl was a bad example.

In a language with split_once (and a proper string reference type) I actually rather like split_once here, splitting away header, and then field1, and then field2, and then field3, and then data, leaving just hash.

I guess this gets to your concern, by writing it with split_once we're clear about a lot of your answers, each delimiter is specified with what it's splitting, none of the fields are optional (unless we write that) and (if we write code to check) what is valid in each field.

jcranmer · on July 5, 2023

Yeah, split_once is pretty handy, although chaining together can get a little verbose. It would be nice to write this:

  let (y,m,d,h,m,s) = break_str!(s, year-month-day hour:minute:second)?;

instead of this:

  let (y, split) = s.split_once('-')?;
  let (m, split) = split.split_once('-')?;
  let (d, split) = split.split_once(' ')?;
  let (h, split) = split.split_once(':')?;
  let (m, s) = split.split_once(':')?;

burntsushi · on July 5, 2023

I'll use this as an opportunity to plug the regex crate's new 'extract' API. :-)

    use regex::Regex;

    fn main() {
        println!("{:?}", extract("1973-01-05 09:30:00"));
    }

    fn extract(haystack: &str) -> Option<(&str, &str, &str, &str, &str, &str)> {
        let re = Regex::new(
            r"([0-9]{4})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})",
        ).unwrap();
        let (_, [y, m, d, h, min, s]) = re.captures(haystack)?.extract();
        Some((y, m, d, h, min, s))
    }

Output:

    Some(("1973", "01", "05", "09", "30", "00"))

That gets you pretty close to what you want here.

(The regex matches more than what is a valid date/time of course.)

Izkata · on July 5, 2023

Python (even in 2):

  >>> import re
  >>> pattern = re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})")
  >>> results = pattern.match('1973-01-05 09:30:00')
  >>> results.groups()
  ('1973', '01', '05', '09', '30', '00')
  >>> (y, m, d, h, min, s) = results.groups()

(`results` will be None if the regex didn't match)

Regexes are one of those things where, once you understand it (and capture groups in particular) and it's available in the language you're working in, string-splitting usually doesn't feel right anymore.

burntsushi · on July 5, 2023

I believe all of the people in this thread understand regexes extremely well. :-)

There is a lot of reasonable room to disagree about when and where regexes should be used.

kbenson · on July 5, 2023

The thing that I think sometimes gets lost when people are talking about the strengths and weaknesses of regular expression use, is that it can matter quite a bit what the language in question is like. If you're using a language that compiles down to a low level or is expected to have a fairly close match of expressions to specific code run (C, C++, Java, etc) then the choice is between using a regular expression library which often does an good enough match of what you want to accomplish in a performance manner that rolling your own optimized parser is hard to justify.

When using an interpreted language like Perl or Python, it's a slightly different story, since often there's no way to get as good of performance out of a custom parser written in that language as you can get from the optimized regular expression library included (if you use it competently). The overhead in the interpreted language can make some tasks impossible to do as quickly as teh regular expression engine can. In a way, a regular expression library is to parsing what Numpy is to computation.

I remember a task I had to parse a bunch of very simplistic XML in Perl a decade ago, where it consisted of the equivalent of a bunch of records of dictionaries/hashes (solr), and needed them as Perl hashes to work on them. I surveyed every XML parsing module available to me in Perl, and couldn't quite get the performance I needed, given the file was a few GB and it needed to be done as quick as possible multiple times a day. I implemented a simple outer loop to grab a working chunk of the file, a regex to split that chunk into individual records text chunks, and another regex to return each key/value as an alternating list I could assign directly to a hash. It was 5-6x faster than the faster XML streaming solution I found that I could use as a Perl module (which were all just wrapping X libs).

Could I have coded a custom solution in C that parsed this specialized format faster? Undoubtedly, but I'd still be marshaling data through the C/Perl interface, so would have taken a hit to get the data easily accessible to the rest of the codebase, which is less of an issue for regular expressions in Perl since they're a native capability of the language and deeply integrated. Using regular expressions as a parsing toolkit yielded very good results at a fraction of the time and effort, and honestly in my opinion that's what regular expressions are all about.

amelius · on July 5, 2023

Does this RegEx library utilize a JIT, just like the ones in most JavaScript implementations?

If not, then perhaps this is a case where JavaScript beats Rust.

burntsushi · on July 5, 2023

Nope. Usually only backtrackers have a JIT (not strictly true, I believe there is a JIT of a Thompson NFA and there is also redgrep). This library is not a backtracker.

I would be interested in experimenting with a JIT in the future, primarily as a way to speed up capture group extraction. But this regex library uses a lazy DFA, which has excellent theoughput characteristics compared to the normal backtracking interpreter (and perhaps even a backtracking JIT).

> If not, then perhaps this is a case where JavaScript beats Rust.

Haha. No. Follow the link to rebar to see performance comparisons. A JIT isn't everything. https://github.com/BurntSushi/rebar

darkclouds · on July 6, 2023

Like you I skimmed this, having done a bit of work on RegEx's recently. I think the language in my case is using the PikeVM, by virtue of there being no error returned unlike the other engines, so I had to, within the limitations of the language and its copyrighted protected status, had to create some new RegEx functionality for myself, to make it easier to use as RegEx's can be plain and simple Voodoo.

I dont have the stats of how often the other engines would be used, but if most of the programming languages out there are using PikeVM, then I can see why Google has not only written their own OS for their servers, but also saved a few more clock cycles by getting other engines into action for specific situations where the PikeVM would be too slow and/or heavy on the clock cycles. I know all to well how a few extra characters in the search string can drastically slow up the pattern matching.

The proverb "take care of the pennies and the pounds will take care of themselves" definitely applies to RegEx and clock cycles. I suspect when looking back at some conversations from the 90's, its made some coders I know very rich when dealing with millions of records per second processing.

user3939382 · on July 6, 2023

My biggest complaint is with the little variations in dialects. Especially how the dialect * context expects you to handle quoting or terminating the expression. It's so varied that I've given up on even trying to remember and just resign myself to googling an example every time I need one.

lifthrasiir · on July 6, 2023

Oniguruma is one of examples where you can actually pick which variation to use. I think Rust `regex` also exposes enough internals to make this possible.

chubot · on July 5, 2023

What kind of content are you looking for? From a user POV, I think there is thankfully not much to learn about automata-based regexes like RE2 and rust/regex. The exception is if you're writing a performance-sensitive system entirely around regexes, like an intrusion detection system or search engine.

A regex is a very small concept with a very large syntax :) It's "just" literals, alternation, repetition, and concatenation. So it's mostly being able to see through the syntax IMO.

Nice animated diagrams in this recent post - https://news.ycombinator.com/item?id=35503089

(On the other hand, from the implementation POV, there is a wealth of material online, but you already know where to find that :) I still think someone should "cliff notes" the Cox RE2 articles, since the ideas are important but there is a lot of detail. There are also dozens of references in those articles for people to read; it was a comprehensive survey at the time.)

---

BTW Friedl's book covers backtracking engines, which makes sense because PCRE, Perl itself, Python, Java, JavaScript, etc. all use that style

But as far as I remember, most of the book is irrelevant for users of automata-based engines. It's a whole bunch of detail about optimizing backtracking, reordering your clauses, and lots of detail about engine-specific and even version-specific differences.

Personally I just ignore all of that, and it doesn't cause me a problem. I use regexes all the time in every language, and stick to the features that an automata-based engine can handle.

Once you go into the backtracking stuff, then you might want the Friedl book, and that's a huge waste of time IMO. It's way easier to do that kind of thing outside of a regex engine, with normal parsing techniques.

And honestly once you go outside, you'll find you don't even need to backtrack. Perl-style regexes can be an astoundingly inefficient way to write a pattern recognition algorithm.

---

Eggex is my attempt to separate the two kinds of things, since it's not obvious based on the syntax: https://www.oilshell.org/release/latest/doc/eggex.html

It makes the syntax smaller, more familiar, and more composable. It's closer to the classic lex tool (e.g. GNU flex). It's funny that lex has a way better syntax for regular expressions for Perl, but it's used probably 10,000x less often because most people want an interpreter and not a compiler to C code. But lex has 90%-100% of what you need, and it's way simpler, with a nicer syntax.

Literals are quoted so you don't have the literal/metachar confusion, and you can reuse patterns by name. The manual is worth skimming:

https://westes.github.io/flex/manual/Simple-Examples.html#Si...

I use re2c though, not flex, and its manual is also worth skimming: https://re2c.org/manual/manual_c.html