Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The context sensitivity of C’s grammar (thegreenplace.net)
73 points by yan on May 2, 2011 | hide | past | favorite | 43 comments


Cool to see this posted here. Just for completeness: this article is part 2 of the one here: http://eli.thegreenplace.net/2007/11/24/the-context-sensitiv..., so if the issue interests you, start with that one.


Thanks for the great blog; always make sure to get to it first in my reader.


An advice to those who like OP first encounter these topics: try to gain the historical perspective, you'll understand everything much better. Find the sources of the original C compilers, marvel that they are probably smaller than the Yacc and Lex (at least that's how I remember them) and then understand that both Yacc and Lex were never needed for these C compilers. You learn about Yacc and Lex in the school more because of their "educational value" than because they are easiest tools to make a C compiler or parser.

As the original authors wrote the compiler the result of not having the context free grammar was just having a few lines of the code more. Their goal was certainly not an academic "parser" purity. Parsing is one of quite uninteresting parts when you're making UNIX and C some 40 yeas ago.


Amateurs use parser-generators.

Professionals write recursive-descent parsers.

Real men code finite state machines by hand: http://galileo.phys.virginia.edu/classes/551.jvn.fall01/fsm....

But seriously, the mystery and ignorance that continues to surround Yacc to this day needs to be fought. If you're using Yacc, you're doing it wrong. As far as parser-generators go, it is totally obsolete, and even when it wasn't, it was the wrong choice for almost all parsing problems. I learned this one day when someone recoded a Yacc-based parser I wrote in recursive-descent style, and it was less lines of code than my specification.

If you need to write a really simple, or a really complicated, parser, recursive-descent is usually the way to go. You might try Henry Baker's META (http://home.pipeline.com/~hbaker1/Prag-Parse.html) as a first alternative. If you really, really feel the need for a parser-generator, take a look at parsing expressions grammars/packrat parsers: http://pdos.csail.mit.edu/~baford/packrat/

But the real lesson here is to design your syntax to be easy to parse and manipulate in the first place. C's grammar is the clearest testament to how little thought Dennis Ritchie put into the design of C.


I have an interest in making parsers, but all mine so far are either hacks or made with ANTLR. Does this fall under "doing it wrong"? (I presume the answer is yes) How would you recommend I go about learning how to do it right?


ANTLR is one of the parser generators that makes Yacc obsolete. So you're doing something right. (I personally think ANTLR is too complicated.)

I've tried working with other parser generators, but I always keep coming back to hand-coded recursive descent. The biggest advantage IMO is that they're both easier to debug, and can be made to report errors in input much better than parser-generator generated ones.


Just don't "make parsers." Make something that does something, better than some other software. Parser is just a piece of that what you make. Once you have the goal, to make the parser piece, use whatever you like, but produce the results you wanted. That's how you really learn something.


Thanks for the condescending advice... one of the last parsers I wrote was for a website with 125,000 active users, they are far from the be-all and end-all for me.


I agree completely about Lex & Yacc, and this is the conclusion I came to at the end of the article. Yacc (leaving Lex out for a moment, since it's a different story) is being taught for its interesting educational value, but once in the wild, you just find that most real-world compilers don't use it.

Regarding historical perspective, unfortunately it isn't so easy to gain. I actually consulted a lot of comp.compiler discussions from the early 1990s, but most links in them point to various non-existent FTP sites :-/


Do you a favor and search the net for the oldest preserved sources of C compilers and UNIX. Then take a look at them, especially how concise they are. I'd enjoy reading your post about that experience. :)


Since you appear to be knowledgeable in these matters, perhaps you could point out to such a source :)

FWIW I did mention 'tcc' in the article, the "tiny c compiler" with rather compact and clean source code, probably smaller than the code for Bison


The full Unix V source is here as tar:

http://unixarchive.tliquest.net/PDP-11/Distributions/researc...

In 500 K gzipped tar you have at least the whole unix, the whole stdlib and the whole C compiler. For the book that explains the sources see:

Book: Lion's Commentary on the Sixth Ed Unix

http://news.ycombinator.com/item?id=2506176


Primeval C: two very early compilers

http://news.ycombinator.com/item?id=2506032


Great, thanks!


So using yacc and lex to do C compilation is a newish idea. Using yacc makes the hard part harder and the easy part easier (dgc).

What is missing from the discussion in a generally good article here is that it isn't so much a "hack to the lexer" but you put information in the symbol table (scope, type) and push out to yacc a symbol with token type attached (variable, typename, etc).

As another commenter here says, C compilers weren't originally done with yacc or other parser generator, they were recursive descent.


Does the parser really need to know this information? XX yy; only means one thing: declare( type: XX, variable: yy). Making that work is the job of something else down the pipeline that knows about the concept of types and symbol tables.

As an example, Emacs syntax-highlights both "YY xx" and "int xx" the same way.


Yes, the parser needs to know about this information. Example:

    (A) * B
This is either A multiplied by B, or type A casting the dereferenced value of B: http://en.wikipedia.org/wiki/The_lexer_hack


Fair enough. Would the problem be fixed if the C standard said that x (without spaces) had to be dereferencing and that x y (with spaces) had to be multiplication?


Yeah, if you could purely lexically distinguish dereferencing from multiplication, either because they used a different symbol, or had different mandatory whitespace rules, then there'd be no ambiguity, at least in this example.

That's basically what C++ did by requiring the space in:

   Foo<Bar<Baz> > var;
to make it easy to lexically distinguish the '>>' right-shift operator from the '> >' sequence of two successive template-parameter closing symbols.

C++0x is changing that though, due to the unpopularity of making programmers accomodate what looks like a parser-implementation hack.


It was as if a million C programmers cried out and were suddenly silenced.

Significant whitespace is evil, at least in the context of C where it is not significant anywhere else.


Whitespace is significant in C. Consider the difference between "f o o" and "foo" :)


Idon'tknowwhatyou'retalkingabout! :)


You like your Commodore 64 BASIC (or Ye Olde Fortran)?


No, consider int x; sizeof(x); and typedef int x; sizeof(x);.


That's not a syntactic ambiguity. Typically (when you're not trying to compress multiple passes, be clever and fast etc.), identifiers are not resolved during parsing, so it doesn't matter whether x denotes a type or a variable when the sizeof operator is applied.


You can write "A * B" and depending on nature of A, that would be a declaration of B as pointer to type B, or multiplication of A and B. You could say that this multiplication would be 'void' (as in 'not assigned to anything') and I could come up with even more complicated example, like "A * B();" where this is either function declaration or multiplication of A and function named B() with side-effect let's say. But that's not the point: if parser has to do heuristics like that to parse language properly, it is already context-dependent or at least not LALR(0) or LALR(1).


I was talking about sizeof.


I think it does matter, although my example fails to make clear why. Type identifiers have different syntactic requirements to regular identifiers (the parens in sizeof(typename) are required, but the parens in sizeof(varname) are optional).


Yes, but now you're getting deeper into requirements for context-sensitive grammar; go far enough, and you might as well go all Van Wijngaarden. A context-free parse is free to create an AST like (sizeof (parens (ident "x"))) for one, and (sizeof (ident "x")) for the other, and disambiguate based on the symbol table lookup of "x" later.


A complete parser has to be strict. Emacs's syntax-highlighting, as we all know, is very heuristic and approximate. While it does a good job for the common cases, it's easy to break it with more freakish examples. A real compiler can't allow that.

To parse this correctly, the parser has to receive a "TYPE-NAME" token for XX and an "IDENTIFIER" token for yy. If the parser receives an IDENTIFIER token for XX, this will result in a parse error.


To be more precise, "a context-free parser has to receive a TYPE-NAME token...". A hand-coded parser would be able to parse things just fine if both XX and yy were IDENTIFIERs.

Or, to put it another way, you resort to uglyfying the lexer to overcome definiciencies in the parser.


Alternately, you can implement such context-sensitive features in another pass after the parse.


Indeed, which is exactly the point of the article this discussion refers to


But "XX XX;" (the example was given in the article) is valid, and the parser needs to know the difference.

The article says that this code is "evil", but I disagree. There is a valid pattern that I use quite often (I described it in an article for ACCU a few years ago, too). It's an example from C++ but I think I could apply it to C as well. Consider (with a modification of the example I used in that article):

struct PirateShip { struct Cannons { char* cannon1; char* cannon2; } Cannons; }

Now you can use

PirateShip s; printf("Name of the first cannon: %s\n", s.Cannons.cannon1);

which is very readable and idiomatic (once you get used to it ;) )

More abstract, it provides an idiomatic way to group collections of objects/structs at the source code/syntactic level.

(one could name the struct and the member differently, I realize; the benefit of using the same name is when you define constant members in the inner struct, that way you can reference constants and properties with the same name and you never have to think about which one to use.)

(also again I use this from C++, there may be things that are different in C, but afaik this part is the same).


I'm not sure how you use "XX XX" in that code sample. Do you refer to the "struct Foo {...} Foo;" idiom? That's quite different, though :)


I think it is, I may be wrong. It's the same as (trying to preserve formatting this time):

    struct PirateShip {
        struct PirateShip {
            char* cannon1;
            char* cannon2;
        }
        PirateShip PirateShip;
    }
Am I misunderstanding?


I've seen similar code being used, but IMHO it's confusing


The parser needs to know about it iff you want your parser to disambiguate certain scenarios. An alternative approach is to handle the ambiguous scenarios generically in a specific parser construct (e.g. build a special AST node), and resolve it later after you've collected type information.


I've been using an LALR(1) parser generator by Paul Mann http://highperware.com/ which gets around this context sensitivity problem. I've had to hack it a bit to handle scopes, but this wasn't a major problem.

It generates an amazing lexer/parser, which can do something like a million lines of code a second. But he hasn't open sourced the parser generator nor ported it from Windows. I've been trying to convince him to do so, and he came oh so close recently.

He's been trying to figure out how to make money from it if he Open Sources it. As it is, everyone seems to be totally ignoring it despite its sophistication compared with flex/bison.


Ah, the binaries are unavailable at present. He's working on a new version and says it'll be up in a week or so.


Apparently they're back now!


Consider this context sensitive C code:

typedef unsigned int uint, uint * uintptr;

Here is how to specify a sample grammar to handle the above statement with a new product called LRSTAR. Note, this grammar has no conflicts and is LALR(1). Here is the grammar:

<identifier> => lookup ()

Declaration -> VarDecl1 /','... ';' -> typedef VarDecl2 /','... ';'

VarDecl1 -> Type... <identifier>

VarDecl2 -> Type... <identifier> => defterm(2,{typedef})

Type -> SimpleType... -> Type '*'

SimpleType -> char -> int -> short -> unsigned -> {typedef}

You can download LRSTAR at http://HighperWare.com.


The LRSTAR binaries are up now with sample projects at http://HighperWare.com




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: