An advice to those who like OP first encounter these topics: try to gain the historical perspective, you'll understand everything much better. Find the sources of the original C compilers, marvel that they are probably smaller than the Yacc and Lex (at least that's how I remember them) and then understand that both Yacc and Lex were never needed for these C compilers. You learn about Yacc and Lex in the school more because of their "educational value" than because they are easiest tools to make a C compiler or parser.
As the original authors wrote the compiler the result of not having the context free grammar was just having a few lines of the code more. Their goal was certainly not an academic "parser" purity. Parsing is one of quite uninteresting parts when you're making UNIX and C some 40 yeas ago.
But seriously, the mystery and ignorance that continues to surround Yacc to this day needs to be fought. If you're using Yacc, you're doing it wrong. As far as parser-generators go, it is totally obsolete, and even when it wasn't, it was the wrong choice for almost all parsing problems. I learned this one day when someone recoded a Yacc-based parser I wrote in recursive-descent style, and it was less lines of code than my specification.
If you need to write a really simple, or a really complicated, parser, recursive-descent is usually the way to go. You might try Henry Baker's META (http://home.pipeline.com/~hbaker1/Prag-Parse.html) as a first alternative. If you really, really feel the need for a parser-generator, take a look at parsing expressions grammars/packrat parsers: http://pdos.csail.mit.edu/~baford/packrat/
But the real lesson here is to design your syntax to be easy to parse and manipulate in the first place. C's grammar is the clearest testament to how little thought Dennis Ritchie put into the design of C.
I have an interest in making parsers, but all mine so far are either hacks or made with ANTLR. Does this fall under "doing it wrong"? (I presume the answer is yes) How would you recommend I go about learning how to do it right?
ANTLR is one of the parser generators that makes Yacc obsolete. So you're doing something right. (I personally think ANTLR is too complicated.)
I've tried working with other parser generators, but I always keep coming back to hand-coded recursive descent. The biggest advantage IMO is that they're both easier to debug, and can be made to report errors in input much better than parser-generator generated ones.
Just don't "make parsers." Make something that does something, better than some other software. Parser is just a piece of that what you make. Once you have the goal, to make the parser piece, use whatever you like, but produce the results you wanted. That's how you really learn something.
Thanks for the condescending advice... one of the last parsers I wrote was for a website with 125,000 active users, they are far from the be-all and end-all for me.
I agree completely about Lex & Yacc, and this is the conclusion I came to at the end of the article. Yacc (leaving Lex out for a moment, since it's a different story) is being taught for its interesting educational value, but once in the wild, you just find that most real-world compilers don't use it.
Regarding historical perspective, unfortunately it isn't so easy to gain. I actually consulted a lot of comp.compiler discussions from the early 1990s, but most links in them point to various non-existent FTP sites :-/
Do you a favor and search the net for the oldest preserved sources of C compilers and UNIX. Then take a look at them, especially how concise they are. I'd enjoy reading your post about that experience. :)
So using yacc and lex to do C compilation is a newish idea. Using yacc makes the hard part harder and the easy part easier (dgc).
What is missing from the discussion in a generally good article here is that it isn't so much a "hack to the lexer" but you put information in the symbol table (scope, type) and push out to yacc a symbol with token type attached (variable, typename, etc).
As another commenter here says, C compilers weren't originally done with yacc or other parser generator, they were recursive descent.
Does the parser really need to know this information? XX yy; only means one thing: declare( type: XX, variable: yy). Making that work is the job of something else down the pipeline that knows about the concept of types and symbol tables.
As an example, Emacs syntax-highlights both "YY xx" and "int xx" the same way.
Fair enough. Would the problem be fixed if the C standard said that x (without spaces) had to be dereferencing and that x y (with spaces) had to be multiplication?
Yeah, if you could purely lexically distinguish dereferencing from multiplication, either because they used a different symbol, or had different mandatory whitespace rules, then there'd be no ambiguity, at least in this example.
That's basically what C++ did by requiring the space in:
Foo<Bar<Baz> > var;
to make it easy to lexically distinguish the '>>' right-shift operator from the '> >' sequence of two successive template-parameter closing symbols.
C++0x is changing that though, due to the unpopularity of making programmers accomodate what looks like a parser-implementation hack.
That's not a syntactic ambiguity. Typically (when you're not trying to compress multiple passes, be clever and fast etc.), identifiers are not resolved during parsing, so it doesn't matter whether x denotes a type or a variable when the sizeof operator is applied.
You can write "A * B" and depending on nature of A, that would be a declaration of B as pointer to type B, or multiplication of A and B. You could say that this multiplication would be 'void' (as in 'not assigned to anything') and I could come up with even more complicated example, like "A * B();" where this is either function declaration or multiplication of A and function named B() with side-effect let's say. But that's not the point: if parser has to do heuristics like that to parse language properly, it is already context-dependent or at least not LALR(0) or LALR(1).
I think it does matter, although my example fails to make clear why. Type identifiers have different syntactic requirements to regular identifiers (the parens in sizeof(typename) are required, but the parens in sizeof(varname) are optional).
Yes, but now you're getting deeper into requirements for context-sensitive grammar; go far enough, and you might as well go all Van Wijngaarden. A context-free parse is free to create an AST like (sizeof (parens (ident "x"))) for one, and (sizeof (ident "x")) for the other, and disambiguate based on the symbol table lookup of "x" later.
A complete parser has to be strict. Emacs's syntax-highlighting, as we all know, is very heuristic and approximate. While it does a good job for the common cases, it's easy to break it with more freakish examples. A real compiler can't allow that.
To parse this correctly, the parser has to receive a "TYPE-NAME" token for XX and an "IDENTIFIER" token for yy. If the parser receives an IDENTIFIER token for XX, this will result in a parse error.
To be more precise, "a context-free parser has to receive a TYPE-NAME token...". A hand-coded parser would be able to parse things just fine if both XX and yy were IDENTIFIERs.
Or, to put it another way, you resort to uglyfying the lexer to overcome definiciencies in the parser.
But "XX XX;" (the example was given in the article) is valid, and the parser needs to know the difference.
The article says that this code is "evil", but I disagree. There is a valid pattern that I use quite often (I described it in an article for ACCU a few years ago, too). It's an example from C++ but I think I could apply it to C as well. Consider (with a modification of the example I used in that article):
PirateShip s;
printf("Name of the first cannon: %s\n", s.Cannons.cannon1);
which is very readable and idiomatic (once you get used to it ;) )
More abstract, it provides an idiomatic way to group collections of objects/structs at the source code/syntactic level.
(one could name the struct and the member differently, I realize; the benefit of using the same name is when you define constant members in the inner struct, that way you can reference constants and properties with the same name and you never have to think about which one to use.)
(also again I use this from C++, there may be things that are different in C, but afaik this part is the same).
The parser needs to know about it iff you want your parser to disambiguate certain scenarios. An alternative approach is to handle the ambiguous scenarios generically in a specific parser construct (e.g. build a special AST node), and resolve it later after you've collected type information.
I've been using an LALR(1) parser generator by Paul Mann http://highperware.com/ which gets around this context sensitivity problem. I've had to hack it a bit to handle scopes, but this wasn't a major problem.
It generates an amazing lexer/parser, which can do something like a million lines of code a second. But he hasn't open sourced the parser generator nor ported it from Windows. I've been trying to convince him to do so, and he came oh so close recently.
He's been trying to figure out how to make money from it if he Open Sources it. As it is, everyone seems to be totally ignoring it despite its sophistication compared with flex/bison.
Here is how to specify a sample grammar to handle the above statement with a new product called LRSTAR. Note, this grammar has no conflicts and is LALR(1). Here is
the grammar: