The context sensitivity of C’s grammar

eliben · on May 2, 2011

Cool to see this posted here. Just for completeness: this article is part 2 of the one here: http://eli.thegreenplace.net/2007/11/24/the-context-sensitiv..., so if the issue interests you, start with that one.

yan · on May 2, 2011

Thanks for the great blog; always make sure to get to it first in my reader.

acqq · on May 2, 2011

An advice to those who like OP first encounter these topics: try to gain the historical perspective, you'll understand everything much better. Find the sources of the original C compilers, marvel that they are probably smaller than the Yacc and Lex (at least that's how I remember them) and then understand that both Yacc and Lex were never needed for these C compilers. You learn about Yacc and Lex in the school more because of their "educational value" than because they are easiest tools to make a C compiler or parser.

As the original authors wrote the compiler the result of not having the context free grammar was just having a few lines of the code more. Their goal was certainly not an academic "parser" purity. Parsing is one of quite uninteresting parts when you're making UNIX and C some 40 yeas ago.

sedachv · on May 2, 2011

Amateurs use parser-generators.

Professionals write recursive-descent parsers.

Real men code finite state machines by hand: http://galileo.phys.virginia.edu/classes/551.jvn.fall01/fsm....

But seriously, the mystery and ignorance that continues to surround Yacc to this day needs to be fought. If you're using Yacc, you're doing it wrong. As far as parser-generators go, it is totally obsolete, and even when it wasn't, it was the wrong choice for almost all parsing problems. I learned this one day when someone recoded a Yacc-based parser I wrote in recursive-descent style, and it was less lines of code than my specification.

If you need to write a really simple, or a really complicated, parser, recursive-descent is usually the way to go. You might try Henry Baker's META (http://home.pipeline.com/~hbaker1/Prag-Parse.html) as a first alternative. If you really, really feel the need for a parser-generator, take a look at parsing expressions grammars/packrat parsers: http://pdos.csail.mit.edu/~baford/packrat/

But the real lesson here is to design your syntax to be easy to parse and manipulate in the first place. C's grammar is the clearest testament to how little thought Dennis Ritchie put into the design of C.

ZoFreX · on May 2, 2011

I have an interest in making parsers, but all mine so far are either hacks or made with ANTLR. Does this fall under "doing it wrong"? (I presume the answer is yes) How would you recommend I go about learning how to do it right?

sedachv · on May 2, 2011

ANTLR is one of the parser generators that makes Yacc obsolete. So you're doing something right. (I personally think ANTLR is too complicated.)

I've tried working with other parser generators, but I always keep coming back to hand-coded recursive descent. The biggest advantage IMO is that they're both easier to debug, and can be made to report errors in input much better than parser-generator generated ones.

acqq · on May 2, 2011

Just don't "make parsers." Make something that does something, better than some other software. Parser is just a piece of that what you make. Once you have the goal, to make the parser piece, use whatever you like, but produce the results you wanted. That's how you really learn something.

ZoFreX · on May 3, 2011

Thanks for the condescending advice... one of the last parsers I wrote was for a website with 125,000 active users, they are far from the be-all and end-all for me.

eliben · on May 2, 2011

I agree completely about Lex & Yacc, and this is the conclusion I came to at the end of the article. Yacc (leaving Lex out for a moment, since it's a different story) is being taught for its interesting educational value, but once in the wild, you just find that most real-world compilers don't use it.

Regarding historical perspective, unfortunately it isn't so easy to gain. I actually consulted a lot of comp.compiler discussions from the early 1990s, but most links in them point to various non-existent FTP sites :-/

acqq · on May 2, 2011

Do you a favor and search the net for the oldest preserved sources of C compilers and UNIX. Then take a look at them, especially how concise they are. I'd enjoy reading your post about that experience. :)

eliben · on May 2, 2011

Since you appear to be knowledgeable in these matters, perhaps you could point out to such a source :)

FWIW I did mention 'tcc' in the article, the "tiny c compiler" with rather compact and clean source code, probably smaller than the code for Bison

acqq · on May 2, 2011

The full Unix V source is here as tar:

http://unixarchive.tliquest.net/PDP-11/Distributions/researc...

In 500 K gzipped tar you have at least the whole unix, the whole stdlib and the whole C compiler. For the book that explains the sources see:

Book: Lion's Commentary on the Sixth Ed Unix

http://news.ycombinator.com/item?id=2506176

acqq · on May 2, 2011

Primeval C: two very early compilers

http://news.ycombinator.com/item?id=2506032

eliben · on May 2, 2011

Great, thanks!

wglb · on May 2, 2011

So using yacc and lex to do C compilation is a newish idea. Using yacc makes the hard part harder and the easy part easier (dgc).

What is missing from the discussion in a generally good article here is that it isn't so much a "hack to the lexer" but you put information in the symbol table (scope, type) and push out to yacc a symbol with token type attached (variable, typename, etc).

As another commenter here says, C compilers weren't originally done with yacc or other parser generator, they were recursive descent.

jrockway · on May 2, 2011

Does the parser really need to know this information? XX yy; only means one thing: declare( type: XX, variable: yy). Making that work is the job of something else down the pipeline that knows about the concept of types and symbol tables.

As an example, Emacs syntax-highlights both "YY xx" and "int xx" the same way.

judofyr · on May 2, 2011

Yes, the parser needs to know about this information. Example:

    (A) * B

This is either A multiplied by B, or type A casting the dereferenced value of B: http://en.wikipedia.org/wiki/The_lexer_hack

jrockway · on May 2, 2011

Fair enough. Would the problem be fixed if the C standard said that x (without spaces) had to be dereferencing and that x y (with spaces) had to be multiplication?

_delirium · on May 2, 2011

Yeah, if you could purely lexically distinguish dereferencing from multiplication, either because they used a different symbol, or had different mandatory whitespace rules, then there'd be no ambiguity, at least in this example.

That's basically what C++ did by requiring the space in:

   Foo<Bar<Baz> > var;

to make it easy to lexically distinguish the '>>' right-shift operator from the '> >' sequence of two successive template-parameter closing symbols.

C++0x is changing that though, due to the unpopularity of making programmers accomodate what looks like a parser-implementation hack.

demallien · on May 2, 2011

It was as if a million C programmers cried out and were suddenly silenced.

Significant whitespace is evil, at least in the context of C where it is not significant anywhere else.

jrockway · on May 2, 2011

Whitespace is significant in C. Consider the difference between "f o o" and "foo" :)

demallien · on May 2, 2011

Idon'tknowwhatyou'retalkingabout! :)

eru · on May 2, 2011

You like your Commodore 64 BASIC (or Ye Olde Fortran)?

gsg · on May 2, 2011

No, consider int x; sizeof(x); and typedef int x; sizeof(x);.

barrkel · on May 2, 2011

That's not a syntactic ambiguity. Typically (when you're not trying to compress multiple passes, be clever and fast etc.), identifiers are not resolved during parsing, so it doesn't matter whether x denotes a type or a variable when the sizeof operator is applied.

mynegation · on May 2, 2011

You can write "A * B" and depending on nature of A, that would be a declaration of B as pointer to type B, or multiplication of A and B. You could say that this multiplication would be 'void' (as in 'not assigned to anything') and I could come up with even more complicated example, like "A * B();" where this is either function declaration or multiplication of A and function named B() with side-effect let's say. But that's not the point: if parser has to do heuristics like that to parse language properly, it is already context-dependent or at least not LALR(0) or LALR(1).

barrkel · on May 2, 2011

I was talking about sizeof.

gsg · on May 2, 2011

I think it does matter, although my example fails to make clear why. Type identifiers have different syntactic requirements to regular identifiers (the parens in sizeof(typename) are required, but the parens in sizeof(varname) are optional).

barrkel · on May 2, 2011

Yes, but now you're getting deeper into requirements for context-sensitive grammar; go far enough, and you might as well go all Van Wijngaarden. A context-free parse is free to create an AST like (sizeof (parens (ident "x"))) for one, and (sizeof (ident "x")) for the other, and disambiguate based on the symbol table lookup of "x" later.

eliben · on May 2, 2011

A complete parser has to be strict. Emacs's syntax-highlighting, as we all know, is very heuristic and approximate. While it does a good job for the common cases, it's easy to break it with more freakish examples. A real compiler can't allow that.

To parse this correctly, the parser has to receive a "TYPE-NAME" token for XX and an "IDENTIFIER" token for yy. If the parser receives an IDENTIFIER token for XX, this will result in a parse error.

boris · on May 2, 2011

To be more precise, "a context-free parser has to receive a TYPE-NAME token...". A hand-coded parser would be able to parse things just fine if both XX and yy were IDENTIFIERs.

Or, to put it another way, you resort to uglyfying the lexer to overcome definiciencies in the parser.

ekiru · on May 2, 2011

Alternately, you can implement such context-sensitive features in another pass after the parse.

eliben · on May 2, 2011

Indeed, which is exactly the point of the article this discussion refers to

roel_v · on May 2, 2011

But "XX XX;" (the example was given in the article) is valid, and the parser needs to know the difference.

The article says that this code is "evil", but I disagree. There is a valid pattern that I use quite often (I described it in an article for ACCU a few years ago, too). It's an example from C++ but I think I could apply it to C as well. Consider (with a modification of the example I used in that article):

struct PirateShip { struct Cannons { char* cannon1; char* cannon2; } Cannons; }

Now you can use

PirateShip s; printf("Name of the first cannon: %s\n", s.Cannons.cannon1);

which is very readable and idiomatic (once you get used to it ;) )

More abstract, it provides an idiomatic way to group collections of objects/structs at the source code/syntactic level.

(one could name the struct and the member differently, I realize; the benefit of using the same name is when you define constant members in the inner struct, that way you can reference constants and properties with the same name and you never have to think about which one to use.)

(also again I use this from C++, there may be things that are different in C, but afaik this part is the same).

eliben · on May 2, 2011

I'm not sure how you use "XX XX" in that code sample. Do you refer to the "struct Foo {...} Foo;" idiom? That's quite different, though :)

roel_v · on May 2, 2011

I think it is, I may be wrong. It's the same as (trying to preserve formatting this time):

    struct PirateShip {
        struct PirateShip {
            char* cannon1;
            char* cannon2;
        }
        PirateShip PirateShip;
    }

Am I misunderstanding?

eliben · on May 2, 2011

I've seen similar code being used, but IMHO it's confusing

barrkel · on May 2, 2011

The parser needs to know about it iff you want your parser to disambiguate certain scenarios. An alternative approach is to handle the ambiguous scenarios generically in a specific parser construct (e.g. build a special AST node), and resolve it later after you've collected type information.

wbhart · on May 2, 2011

I've been using an LALR(1) parser generator by Paul Mann http://highperware.com/ which gets around this context sensitivity problem. I've had to hack it a bit to handle scopes, but this wasn't a major problem.

It generates an amazing lexer/parser, which can do something like a million lines of code a second. But he hasn't open sourced the parser generator nor ported it from Windows. I've been trying to convince him to do so, and he came oh so close recently.

He's been trying to figure out how to make money from it if he Open Sources it. As it is, everyone seems to be totally ignoring it despite its sophistication compared with flex/bison.

wbhart · on May 2, 2011

Ah, the binaries are unavailable at present. He's working on a new version and says it'll be up in a week or so.

wbhart · on May 2, 2011

Apparently they're back now!

paulbmann · on May 2, 2011

Consider this context sensitive C code:

typedef unsigned int uint, uint * uintptr;

Here is how to specify a sample grammar to handle the above statement with a new product called LRSTAR. Note, this grammar has no conflicts and is LALR(1). Here is the grammar:

<identifier> => lookup ()

Declaration -> VarDecl1 /','... ';' -> typedef VarDecl2 /','... ';'

VarDecl1 -> Type... <identifier>

VarDecl2 -> Type... <identifier> => defterm(2,{typedef})

Type -> SimpleType... -> Type '*'

SimpleType -> char -> int -> short -> unsigned -> {typedef}

You can download LRSTAR at http://HighperWare.com.

paulbmann · on May 2, 2011

The LRSTAR binaries are up now with sample projects at http://HighperWare.com