Morgan's "Building an Optimizing Compiler" recommends doing a "spike" method: http://www.amazon.com/Building-Optimizing-Compiler-Bob-Morga.... The idea is that you should write enough of each stage, from the front-end to the back-end, to get one feature working. Then do the same for the next feature, and so on. That way, you have very early insight into how the stages fit together.
Ghuloum's scheme compiler tutorial essentially follows this method: http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf. It starts with emitting x86 machine code for constants, then unary and binary primitives, local variables, conditionals, heap allocation, procedure calls, and finally closures. It helps that Scheme has little to no parser to speak of, but it is straightforward to extend the technique to a language that does.
I'm convinced this is the only sane method when you are self-teaching. If you try to write an entire LALR grammar at once and then you screw up, you're in a world of hurt trying to figure out what went wrong.
I was a TA on a compilers course that worked backwards. I'd taken the course myself two years earlier when it was taught by a different lecturer and found it excellent, and most of my cohort agreed.
When reversed, however, most of my students found it extremely hard to wrap their head around the the first part of the course, i.e. code generation from an AST. They didn't have sufficient grounding in recursive data structures and algorithms. While this might be a problem with the degree program rather than the compilers course, I believe that learning to write a recursive descent parser (by hand) provides this grounding and makes the code generation bit much easier to grasp.
Unfortunately the new lecturer had also done away with teaching the students how to write parsers by hand, instead encouraging them to use parser generators. I thought that was tragic. One of the best moments of my programming life was writing a recursive descent parser for a non-trivial grammar in one sitting (about 500 lines of python), hitting run and having it just work first time. It was orgasmic.
> One of the best moments of my programming life was writing a recursive descent parser for a non-trivial grammar in one sitting (about 500 lines of python), hitting run and having it just work first time.
This was, in the grand scheme of things, likely offset by the hours of frustration most of the students went through while debugging their parser which did not work perfectly the first time.
People at Brown University simply cut the syntactic part away from compilation. They use s-exp and start working on semantics and compilation. IMO it is the best way. Syntax is an immense waste of time. The fun is in formal transformations, IR, tiles, register allocation and such.
Agree, grammars can be fun, but IMO syntax is the source of many wasted efforts and endless debates. Just use s-exps. Easy to parse, easy to manipulate, maybe not the easiest to read statically, but you can ask your ide/editor to query/transform it to check or gather data.
My opinion is that a compiler is best understood as partial evaluation of an interpreter. So I'd start with an interpreter and give an example of compilation which simply concatenates the code parts of the interpreter without dispatch logic. This is what compilation is. The rest is optional optimizations.
Similarly, I think parsing is best understood by implementing a parsing algorithm. While popular algorithms like LL and LR are rather complex, CYK algorithm is straightforward to implement, directly corresponding to the definition of context-free grammar. Other algorithms are optimizations.
For ordering, I think it is best to start with AST and its interpretation. And then parsing and lexing (in that order). And then compilation. And then extending the language with features. And then various optimizations.
If you want any chance of understanding real compilers, the "optional" optimizations and parsing algorithms are actually pretty important. I do think starting with an interpreter for an AST is just as good a method as starting with code generation, and code optimization can probably wait until the end, but teaching parsing should avoid any kind of CYK/LL/LR unless you're specifically teaching grammar theory.
Using a parser generator of some kind is possibly okay to teach the concept of a grammar, but it's terrible for error handling, efficiency, and simplicity. For most programming languages, nothing beats a plain old hand-written recursive descent parser (optionally with pratt parsing for expressions) on those metrics. And indeed, most compilers I've seen in production are written that way.
I think that a lot of things are better taught from the metal upward than the opposite.
I had two course for networking the first taught me the layers top to bottom, I learned by hart the list of layers to pass the course but didn't understand it, the other was bottom to top and everything was clear then.
Same thing with virtual functions, once I've learned how it was implemented then I was ready to understand why it was useful, not before..
The only problem with going backwards is that you are working from a drive-by spec at each step. You can tell students "this is the instruction set for the code you need to generate" and then "this is the intermediate language for the language you will parse."
None of the work is motivated concretely by the work of the prior step.
Fine if you are designing the representations for them, but I'm wondering how they would design an AST for a language they haven't seen or haven't thought about how to parse.
It would be an interesting experiment. If you were going to have them design the internal representations, you could have then move backward in a why that allows them to compile a family of languages. They could create something as general as the LLVM suite. :)
> I'm wondering how they would design an AST for a language they haven't seen or haven't thought about how to parse.
A third alternative would be to start in the "middle" with the AST, and go both forwards and backwards from there. :)
So, start with the AST as a given, or even have them design it, and use it to drive the design of the concrete syntax, static analysis, code generation, etc. At each phase, cover relevant concerns (i.e. "here's how to avoid shift-reduce conflicts in the concrete syntax").
I know nothing about teaching, so probably this wouldn't work, just speculation.
The motivation is the same as one gets from moving from C to C++; you start with a small language (in this case, assembly) and then build on top of it. I've tried this recently in my own course, and found that the morale boost from having something that actually ran and produced results that they understand significantly improved learning.
As for AST design, it is easier to design the AST first than you might think. Just ask the designers of Lisp.
I have another idea, based on the observation that traditional compiler building courses are like a "waterfall" and the OP's proposal is simply a "reverse waterfall."
What if the course was designed based on an "Agile," evolutionary methodology?
In that course, you would divide the course work in 4 "scrums". In each scrum you do a little bit of everything (little bit of lex, little bit of parse, little bit of syntax analysis, little bit of code generation). In the first scrum you also would cover the architecture and high level overview of the "roadmap."
Of course, at the end of the first "scrum" you would have a very primitive compiler for a very primitive language and without a lot of error checking, but students would have been exposed to the end to end experience.
The second and subsequent "scrums" would only deepen each part of the compiler, and students would benefit from knowing the interactions across modules and simply learning more sophisticated techniques.
I remember when building my first compiler in college that I really had no clue why we were learning about tokenization and state machines for so long when I really wanted to generate optimized code.
I hate the 'scrum' + 'agile' buzzwords, but if those are stripped out, I like the idea of starting off by building a crappy but end-to-end compiler for a language that does almost nothing, and making it nicer as the course moves along.
I understand that, I'm just unsure of what benefit having scrums would be in this context, or whether that term is even appropriate for what is essentially an instructor introducing the next assignment (which never needed a special name before.)
I've taught a compiler design course every other year for awhile and one of the big drawbacks is that the students don't have an intimate knowledge of machine/assembly language for any specific architecture. 20 years ago programmers understood how the underlying machine and OS worked in detail -- nowadays programmers/students are strongly insulated from these details.
Ideally I would want to teach a two semester course starting with the backend first -- the first task would be to write several significant programs in assembly language (for some clean RISC machine) and grade them on how well they managed register allocation, stack frame / heap management, page alignment, etc.. Then cover how to convert some SSA IR (e.g., LLVM) into that language.
How do you translate into a language you don't understand?
At least at GT 10 years ago, a machine architecture course (starting with logic gates, moving on to MIPS assembly, then CPU architecture) was pretty early in the CS curriculum. Seems pretty essential, is it not common?
At Utah (graduated three months ago) we were required to take one class that used MIPS and another class that used x86 (the compilers class was optional). They were great, and I agree seem pretty essential.
What better way to teach someone assembly, etc, than to teach them how to convert code that they understand to code that the architecture can understand?
When I wrote my first compiler (really more of a transpiler), I didn't know any theory about it. I just knew how regexes worked, had an intuition that I can't just regex my way out of everything and went at it.
Started with an interpreter. Eventually turned it into a compiler. Really fun project, ran into a bunch of weird issues that would have quickly been solved had I read any theory whatsoever about compilers.
But looking back at it now, after having taken an actual compilers course in college, it did sort of have all the stages. They just happened to all be mushed together and not very cleanly separated or very cleanly defined.
The grammar, for instance, was more like a bunch of if statements and function calls that kinda sorta made the compiler do what I wanted ... but corner cases kept cropping up ...
I kind of agree with the article, I did the Edinburgh course on Compiling Techniques and I remember the first coursework being using antler and convert something like tinyc to AST (or at least a very small subset of c). Most of this was just fiddling around with the grammar file until the output seemed sensible. The second part was writing an AST -> jvm bytecode compiler.
Going back to front may have been better, because we would have had more time for the 'hard' part (being during the quieter first half of the semester) as well as giving us a better understanding of how the AST should look, as we would have had to write/modify examples of the AST in order to get the backend working properly.
Another advantage of working backwards is that it's more obvious that the abstract syntax can be designed before the concrete.
Given an abstract syntax, it's super-easy to create an unambiguous concrete one (it may be ugly and verbose, but can completely avoid the issues of operator precedence, shift-reduce conflicts, lookahead, ambiguity, top-down vs. bottom-up, etc. etc.). If concrete syntax turns out to be interesting, it's always possible to design a second one (for the same abstract syntax) that's much more usable, pretty, etc.
When the language is modified, the focus can be placed more on the abstract syntax, and less on concrete syntax.
I agree wholeheartedly with this. I go to Stanford and took the compilers course this past spring. Honestly, I almost dropped the class after the first 2 assignments because having to learn Bison and Flex to implement a lexer and parser taught me very little about what was actually happening under the hood and I spent most of my time figuring out the syntax. Static analysis and code gen were really cool though and I feel like those assignments taught me a lot more about what modern compilers are actually doing and how they can help optimize my code.
Just another data point, but I really enjoyed my course, even starting with the lexing/parsing. We did do everything from scratch though. I imagine learning to use bison/flex in a class isn't very rewarding.
Was it that you wanted to write your own custom recursive-descent parser, or you just wanted to use ANTLR or something? To me it makes sense that a compilers course would de-emphasize parsing.
Many compiler courses concentrate too much on the lexing and parsing parts of the front-end. I find it quite daunting and boring. Alternative solution can be to start the course from AST.
The details are blurry but one thing I remember from my compilers course (Warsaw University, around 1995) was generation of simple stack-based machine code from an AST. Arithmetic, simple branching and loops were covered.
This was a very effective way to grasp the code generation without getting bogged in lexing/parsing, register assignment (the machine was stack-based) and so on.
Sadly we later moved to the hard parts. Manual LALR(1) machine generation was probably the dullest grindwork in the whole CS curriculum :)
The way compilers works at CMU is by the end of the first lab you have a fully functional compiler lexing/parsing -> code generation for a subset of the C0 language. With each subsequent lab, you implement more and more features of the language. http://www.cs.cmu.edu/~fp/courses/15411-f13/assignments.html
This is definitely closer to what is described in this post than most compilers courses, but not entirely the same.
I agree with the post in that the current common method of starting with _just_ parsing/lexing is wrong. It's a shame that the first part of the compiler you write in these courses is, in my opinion, the most boring and least relevant to compiler theory. CMU's method of giving you the whole, although fuzzy, picture within the first two weeks wasn't viewed as wrong by me, my lab partner, or anyone else I knew who took the compilers course. In fact, everyone I knew who successfully completed the course highly recommended it to everyone they knew. This is a pretty good measure of success for a course.
I think the reason books/courses spend so much time on lexing/parsing is that finite automata, etc, are very broadly applicable concepts, more so than the later material.
The parsing/lexing part is the onyl part I've made direct use of in real projects. Everything else provided very important background knowledge that has occasionally proved useful, but little that applies to anything other than writing a compiler.
It seems like it would be difficult to combine this teaching approach with the nanopass approach. At least, I can't immediately see how it would work or feel to do that.
When I was there (and I assume still) the UCSC compiler courses did it the forward way. It was a fantastic experience; I don't know how it would compare to this flipped version (for which I can certainly see an argument).
Ghuloum's scheme compiler tutorial essentially follows this method: http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf. It starts with emitting x86 machine code for constants, then unary and binary primitives, local variables, conditionals, heap allocation, procedure calls, and finally closures. It helps that Scheme has little to no parser to speak of, but it is straightforward to extend the technique to a language that does.