This seems like a cool project. This is meant as additional information not crit...

xiaq · on Aug 30, 2024

Right, I may have forgot to mention that lexerless parsers are somewhat unusual.

I didn't have much time in the talk to go into the reason, so here it is:

- You'll need a more complex lexer to parse a shell-like syntax. For example, one common thing you do with lexers is get rid of whitespaces, but shell syntax is whitespace sensitive: "a$x" and "a $x" (double quotes not part of the code) are different things: the first is a single word containing a string concatenation, the second is two separate words.

- If your parser backtracks a lot, lexing can improve performance: you're not going back characters, only tokens (and there are fewer tokens than characters). Elvish's parser doesn't backtrack. (It does use lookahead fairly liberally.)

Having a lexerless parser does mean that you have to constantly deal with whitespaces in every place though, and it can get a bit annoying. But personally I like the conceptual simplicity and not having to deal with silly tokens like LBRACE, LPAREN, PIPE.

I have not used parser generators enough to comment about the benefits of using them compared to writing a parser by hand. The handwritten one works well so far :)

throwaway2016a · on Aug 31, 2024

That example you gave could certainly be done in Lex/Flex and I assume other lexers/tokenizers as well, for instance, you would probably use states and have "$x" in the initial state evaluate to a different token type than "$x" in the string state.

But I do get your meaning, I've written a lot of tokenizers by hand as well, sometimes they can be easier to follow the hand written code. Config files for grammars can get convoluted fast.

But again, I was not meaning it as criticism. But your talk title does start with "How to write a programming language and shell in Go" so given the title I think Lexers / Tokenizers are worth noting.

xiaq · on Aug 31, 2024

Yeah, ultimately there's an element of personal taste at play.

The authoritative tone of "how to write ..." is meant in jest, but obviously by doing that I risk being misunderstood. A more accurate title would be "how I wrote ...", but it's slightly boring and I was trying hard to get my talk proposal accepted you see :)

throwaway2016a · on Aug 31, 2024

As someone who has given a handful of talks at conferences.. 100% relatable.

ridiculous_fish · on Aug 30, 2024

Shells have somewhat unusual parsing requirements. For example "if" is a keyword when used as `if echo` but not `echo if`.

So you either need to implement the lexer hack, or have a "string" token type which is disambiguated by the parser (which is what fish-shell does).

https://en.wikipedia.org/wiki/Lexer_hack

throwaway2016a · on Aug 31, 2024

That's no problem in many modern lexers as they usually have a "state" so when you encounter "echo" you can switch to a new state and that state may have different token parsing rules. So "if" in the "echo" state could be a string literal whereas it may be a keyword in the initial state.

Lex/Flex takes care of that mostly for you which is one of the benefits of using a well worn lexer generator and not rolling your own.

radiospiel · on Aug 30, 2024

unless i miss something this should not be an issue. the lexer could parse if as an IF token, and the parser could treat tags as STRING || IF ( || other keywords… )

duskwuff · on Aug 30, 2024

That seems like it'd get really awkward pretty quickly. "if" isn't unique in this regard; there are about a hundred shell builtins, and all of them can be used as an argument to a command. (For example, "echo then complete command while true history" is a valid shell command consisting entirely of names of builtins, and the only keyword in it is the leading "echo".)

deathanatos · on Aug 30, 2024

You'd have to `|| EVERY_KEYWORD_IN_LANG`, and then if you ever add a keyword, now you're updating that list there, and anywhere else you've used it.

As the "Lexer hack" Wiki page says, this is only a problem if you're lexing in the first place. If you just parse the grammar, this isn't a problem.

hnlmorg · on Aug 30, 2024

The problem lies with shells extensive usage of barewords. If you could eliminate the requirement for any bareword to be treated as a string then parsing shell code would then become much simpler...but also few people would want to use it because nobody wants to write the following in their interactive shell:

    git "commit" "-am" "message"

    ls "-l"

 etc

lolinder · on Aug 31, 2024

> Dealing with the Yacc file is often much easier than dealing with code directly since it takes care of writing the actual parser. There is a lot of boiler plate that goes into parsers that when you use Yacc it "just works".

Honestly, I think this is overstating the amount of boilerplate in a parser and overstating how well a parser generator "just works". I haven't used Yacc, so maybe it's better than ANTLR, but having tried ANTLR and written a few recursive descent parsers I've been pretty well cured of wanting to ever use a parser generator. ANTLR's generated code is verbose, the data structures are hard to work with, and error handling leaves a lot to be desired.

Parser boilerplate can be reduced to a large extent with a good set of helper methods (I often find myself referring back to the set used in Crafting Interpreters [0]), and what you get in exchange is full control over the data structure generated by the parser and over the error handling. For a language that you're serious about, that tradeoff is totally worth it.

[0] http://craftinginterpreters.com/

pianoben · on Aug 31, 2024

Maybe it's just my skill level, but I've used both hand-rolled recursive-descent and ANTLR for the same project (Thrift parser), and hoo boy I would never go back to recursive-descent for that. ANTLR shrank my code by an order of magnitude, and cleaned up some bugs too.

I'd be willing to believe that beyond a certain level of input complexity, ANTLR no longer pays for itself. In my experience, there exists a class of languages for which there's no better tool.

xiaq · on Aug 31, 2024

I would love to see the diff between the hand-rolled recursive-descent parser and the ANTLR syntax!

I certainly feel the amount of boilerplate in my hand-rolled recursive-descent parser is manageable. Of course it's not as succinct as an EBNF grammar:

- For example, you have to write an actual loop (with "for" and looping conditions) instead of just * for repetition

- The Go formatter demands a newline in most control flows

- Go is also not the most succinct language in general

So you do end up with many more lines of code. But at the end of the day, the structure of each parsing function is remarkably similar to a production rule, and for simpler ones I can mentally map between them pretty easily, with the added benefit of being able to insert code anywhere if I need something beyond old-school context-free parsing.

adastra22 · on Aug 31, 2024

> This video seems to combine the concepts of lexing and parsing. It is usually beneficial to separate these two steps and lex the input into tokens before passing to the parser.

Historically, yes. In recent years combined lever-parsers have outperformed dedicated lexer + dedicated parser combinations, and with modern tooling this isn’t the janky mess it used to be. Some of the best tools out there are combined lexer-parsers.

eru · on Aug 31, 2024

> - This video seems to combine the concepts of lexing and parsing. It is usually beneficial to separate these two steps and lex the input into tokens before passing to the parser.

With traditional techniques, yes. But if you eg use parser combinators (which would admittedly a bit unusual in Go), combining both steps is pretty common.

> - Go actually has a pure Go implementation of Yacc in the toolset and I've used it in several projects to make parses. Dealing with the Yacc file is often much easier than dealing with code directly since it takes care of writing the actual parser. There is a lot of boiler plate that goes into parsers that when you use Yacc it "just works".

You are right that it's best to avoid Go when you can. Just like Java folks (stereotypically) seemed to avoid writing Java at all costs and rather wrote XML config files to drive their logic.

Yacc (and lex) are otherwise not good choice for specifying languages these days.