On the Cleverness of Compilers

corysama · on Aug 25, 2013

I've worked with a sufficiently smart compiler before: The HLSL (shader) compiler for the Xbox360. It truly is a grand experience. The general recommendation was to not try to outsmart the compiler. Write everything as straight-forward as possible. Just watch out for work done on values that influence the result but aren't really necessary (ex: multiplying by an input that will always be 1).

There were several occasions where I'd write out some elegant math and think "Yeah, that's pretty. But, it would run 30% faster if I transposed everything. But, then I'd have to hand-interleave the internals of all these functions..." Then I'd check the assembly output and Bam! the compiler was already producing exactly what I had in mind!

The downside was that it took several minutes to compile <100 line sources that produce <1000 instruction programs. That's with a language completely specialized around small-scale linear algebra that maps very directly to hardware that is completely specialized around small-scale linear algebra; with branching highly discouraged, all functions calls expected to inline away completely, no recursion, no pointers, no heap, no imports, tiny-if-any headers and a tiny standard lib.

It's my understanding that most of the smartness of the 360's HLSL compiler came from using O(n^3) analysis algorithms that were simply impractical at a larger scale. Our game had thousands of shader programs that were mechanically specialized for different situations. As a result, recompiling all of them after changing the tiny shared header took many hours. "Thankfully" the C++ of the game took so long to compile that we were already using Xoreax Incredibuild to distribute the compilation. We adapted that setup to distribute shader compilation and brought the turn-around time down to 15 minutes by leaching cycles from several dozen of my co-workers' machines.

foobarbazqux · on Aug 25, 2013

Most optimizing compilers use data flow analyses which are cubic, so it's not just that it's O(n^3).

http://dl.acm.org/citation.cfm?id=788019.788876

larsberg · on Aug 25, 2013

You can do pretty well for analyses where n is the number of variables instead of the number of program points / expressions. In those cases, the typical trick is to shorten the height of the tracked lattice, as you only get in trouble in a few places.

To make that concrete, control-flow analysis (which answers "which possible functions could I call through this variable?") is cubic in general. But if you cut it off to only tracking N separate functions and then giving up to ANY when you hit that N, you can cut off the few bad cases (e.g., all of the bindings to the function passed to map, which you will probably just inline later anyway) that generate the cubic behavior and keep your execution time well-behaved, even for whole program compilers, at very small loss of precision in practice.

foobarbazqux · on Aug 26, 2013

If your language has pointers or aliases, precise call graphs are exceedingly expensive (for large programs), because you need a precise pointer or alias analysis. Do you have a link to a publication about the technique you describe?

larsberg · on Aug 26, 2013

There are meatier technical presentations around about the formal bounds, but I always recommend Manuel Serrano's early work, which is much more accessible than more modern presentations (my own included):

http://dl.acm.org/citation.cfm?id=315891.315934

A good survey of many of the tricks used in practice (and the formal techniques, as well) is available in Jan Midtgaard's tome, which was finally published in the last year or so:

http://dl.acm.org/citation.cfm?id=2187671.2187672

foobarbazqux · on Aug 27, 2013

Cool, thanks. I wasn't clear whether you were talking about functional programs, but I can believe that it works for them.

jevinskie · on Aug 25, 2013

It becomes important to fast path the low hanging "can't optimize" condition checks and bail out as early as possible.

derleth · on Aug 26, 2013

> The downside was that it took several minutes to compile <100 line sources that produce <1000 instruction programs.

This sounds like the Stalin Scheme compiler.

Simple benchmark (vs C (gcc), Stalin wins): http://justindomke.wordpress.com/2009/02/23/the-stalin-compi...

GUI addon: http://code.google.com/p/stalingui/

http://community.schemewiki.org/?Stalin

Download link: https://engineering.purdue.edu/~qobi/software.html

ANTSANTS · on Aug 26, 2013

Stalin is very impressive, but

> Simple benchmark (vs C (gcc), Stalin wins)

Did you read the comments?

> Nice, but part of the reason the C version is slow is that you are using abs instead of fabs. abs takes an int as an argument (this should be a compiler warning!) – passing a double other than zero always returns a large number. The C version is basically iterating till the error is below double’s resolution.

> Excellent point! Making the switch to fabs brings the gcc-inline speed down almost exactly to what Stalin is getting.

ef4 · on Aug 25, 2013

Part of the reason compilers don't do really thorough analysis and optimization is that it makes the compiles take a long time.

But we can make that not matter if we rethink the relationship between the compiler and the programmer. What if you got the first version of your binary very quickly, but then you could let the compiler run arbitrarily long, and watch it continue to produce better and better versions?

What if the compiler could compile not just source code, but diffs of source code? It could apply your one-line change without redoing its exhaustive analysis of all the things that you couldn't have effected.

Compilers try hard to avoid whole-program analysis, but I think that's misguided. Spend the cycles, they're getting cheaper all the time. If you can update your analysis incrementally as the programmer works, you're golden.

kazagistar · on Aug 26, 2013

The diff thing sounds good, but there are many layers of optimizations in between the top and the bottom, and so a small change can cascade into a tremendous number of changes. Also, holding the entire tree of intermediate data, at each step, uses a lot of memory.

That said, check out Julia: for any function, you can see how it looks at each step of compilation, down to assembly: source, optimized source, type inferred source (for given input types; it is a dynamic language), LLVM bytecode, and assembly. All visible and introspectable at runtime from either the REPL or scripts.

eru · on Aug 27, 2013

I am currently working with an in-house language that does only whole program analysis. If you have to be restricted to one or the other, you want separate compilation---the compiler times are just so horrible otherwise. (Optional whole program analysis compilation is fine, though.)

Dewie · on Aug 25, 2013

> Part of the reason compilers don't do really thorough analysis and optimization is that it makes the compiles take a long time.

Compiler flags are all you need to solve that, no?

wtallis · on Aug 25, 2013

Only in the strictest sense. The mere presence of the switches does nothing on its own to improve the workflow. You need a development environment that is aware of why you're running the compiler: basic syntax checking, deeper static analysis using expensive heuristics, unoptimized code generation for quick turnaround time on correctness tests, or optimized code generation for performance testing.

mre · on Aug 25, 2013

> Doubtless the first difference that jumps out at you is that one of these programs is written using a beautiful, familiar surface syntax that makes you feel right at home, while the other uses an ugly, foreign notation that makes your eyes glaze over.

I love that sentence. A sharp-witted summary of the recent imperative vs. functional discussion.

dandrews · on Aug 25, 2013

"One is a genius, the other's insane."

qb45 · on Aug 25, 2013

I love it too. Having grown up on assembly and C, I have no doubt which of these programs makes me feel right at home and which my eyes glazed over :)

But OTOH I recently worked on porting some ancient non-modular imperative code to work on newer systems and I have to agree that imperative programs can be evil.

oh_teh_meows · on Aug 25, 2013

For people who are interested in learning about compiler optimizations (they are a wonderful area of research in their own right!), I'm throwing out a list here so you can google about them (if you know more, please reply to this comment! I'd love to learn more about them :D)

Expression related optimizations: Constant Folding, Constant Propagation, Global Propagation, Strength Reduction, Common Subexpression Elimination, Partial Redundancy Elimination, Induction Variable Elimination, Reassociation

Loop related optimizations: Loop Invariant Code Motion, Loop Peeling, Loop Unrolling, Loop Distribution, Loop Autoparallelization, Loop Fusion, Loop Fission, Loop Interchange, Loop Tiling/Stripmining, Vectorization, Scalarization

Memory/cache related optimizations: Cache blocking, False Sharing Elimination, Structure Peeling, Structure Splitting, Array Contraction, Multi-dimensional Array Dimension Reordering

Control flow related optimizations: Code block re-ordering (by frequency), Branch prediction (by static analysis/feedback guided), Code hoisting/sinking (to optimize CPU pipeline), Automatic Inlining, Tail Call Optimization

Code generation related optimizations: Register allocation (np complete), Peephole optimization, Superoptimization (no one really does this yet, but cool nonetheless)

Analysis (interprocedural/intraprocedural): Alias Analysis (flow/context (in)sensitive), Points-to Analysis, Escape Analysis, du chain Analysis, Live Variables Analysis, Memory Access Pattern Analysis (to guide memory layout optimizations), Available Expressions/Copies Analysis, Loop Dependency Analysis, Control Flow Analysis, Dominator data flow Analysis, Globals Analysis

Miscellaneous: Static Single Assignment, Data Flow Analysis (in, out, transfer, meet!), Worklist Algorithm (for Data Flow Analysis!), Symbolic Execution/Abstract Interpretation

Writing straightforward code is the best thing you can do for a compiler. They can tell exactly what you want to do, and do their best to generate the fastest/most memory efficient code possible. Help compiler help you!

Also, compiler usually comes with several different levels of optimizations. And you can tune each one of these optimizations, like you would with a racing car. Read the manual!

pg · on Aug 25, 2013

Please don't use all caps for emphasis.

oh_teh_meows · on Aug 25, 2013

I'm sorry it bothered you. I'll edit them.

nsmartt · on Aug 25, 2013

https://news.ycombinator.com/formatdoc

AlexanderDhoore · on Aug 25, 2013

What ever happened/happens to Stalin [1], the scheme compiler? That's supposed to be a real Smart Compiler, right?

It's so mysterious... It's developed by this one guy. The only thing you can find on his website is a tarball [2]. It's like programming mythology.

[1] http://community.schemewiki.org/?Stalin [2] https://engineering.purdue.edu/~qobi/

abecedarius · on Aug 25, 2013

That one guy is the same Siskind mentioned in the post.

rayiner · on Aug 25, 2013

It's a test bed for aggressive control-flow based optimization a for Scheme. It's not a practical production compiler--too slow and requires whole program analysis.

jevinskie · on Aug 25, 2013

Right, the qobi doesn't even use that scheme in his AI class. He uses another scheme that he wrote. =)

tachyonbeam · on Aug 25, 2013

Compilers are getting smarter over time. Better analyses and clever optimization schemes are being designed and implemented. In my opinion though, as a compiler writer, the best route is to make languages easier to analyze. Things like purity and referential transparency (e.g.: as seen in Haskell) make code analysis and transformation much simpler. They enable things like pattern-matching type transformations on code. If the compiler has to reason about pointers, invisible side-effects and alias analysis, it greatly limits the kinds of things that can be inferred about a program.

btilly · on Aug 25, 2013

This article was worth it to me if only for the following observation:

And there we have a clue as to why, in modern computing, modular code is not high-performance code. The very advantage of modularity is the source of its performance cost: modularity is the ability to do things other than what you did. Modularity is the ability to reuse the same idea in many different ways and in many different places.

marijn · on Aug 25, 2013

See also Rust's emphasis on zero-cost abstractions [1]. The current compiler doesn't emphasize specialization (though it does specialize generics), but the ability to abstract in source code without leaving any trace in the actual compiled binary is a design goal.

[1]: http://www.infoq.com/presentations/Rust

gdy · on Aug 25, 2013

It is not necessarily so for C++, thanks to its templates.

ScottBurson · on Aug 25, 2013

The first encounter I had with automatic differentiation was in a simulation language called PROSE, circa 1974. I never actually used it, but I came across a manual for it, which I found quite fascinating. As I recall, it ran on the CDC 6000 series -- at the time I was doing numerical physics on a 6400, and PROSE was being presented as an alternative to the ubiquitous FORTRAN.

PROSE seems to have been completely forgotten; I can find nothing on it now. But I distinctly recall that it did AD.

Everything in computer science was invented before 1980 :-)

fsck--off · on Aug 26, 2013

This page should keep you entertained for a while:

http://metacalculus.com/prose.html

It links to several manuals (most likely including the one you read) and has links to papers about PROSE and it's predecessor, SLANG.

ScottBurson · on Aug 26, 2013

Oh wow! Thanks!!!

srean · on Aug 26, 2013

Wow thats quite coincidence, but before I get in to that, I like the the tone on HN much much better today. There were several language related posts, all were reasonably well received. None with the frequent hostility that I have come to see on HN in the recent past. No one, came here to piss and rain smugly on another's voluntary piece of work with "Repeat after me, no one gives a flying f%$#! about your language, where are the libraries, where are the users. Dont waste your time in pointless excercises".

That said, the coincidence that I mentioned is entirely personal. After a hiatus, I am getting my toes wet again in this language that I find quite interesting. Its a whole program analyzed, aggressively inlined, functional, but not purely so, ML like type-checked and type-inferred language with generics that interacts quite effortlessly with C++, without the need for any ffi like library. It has coroutines and preemptive threads.

I think of it as something that does the same to C++ what Scala does to Java or F# does to C#. It is really performant and achieves the speed via aggressive inlining, tail calls, whole program analysis and what can be called opportunistic but indeterminate laziness. I have not quite grokked its model yet, one thing that I want to get a handle on is to be able to reason what triggers garbage collection and what doesn't. According to the author of the language, garbage collection can be mostly avoided, but I am not yet good at this aspect, but its just been a few days that I have actually used the language (as opposed to reading about it).

Its statically typechecked but feels unusually flexible in one aspect: one can move functions and types around the between and across different translation units freely (for refactoring) without the need for forward declarations because the linkage semantics are permutation independent.

The language that I am talking about is Felix http://felix-lang.org/share/src/web/tutorial.fdoc http://felix-lang.org/share/src/web/nutut/intro/intro_index..... It has been discussed on HN a few times and I have mentioned it a few times myself just because I find it really interesting, not because I have a dog in the fight. The language author does frequent HN but very very rarely.

andrewflnr · on Aug 25, 2013

Personally, the way I want to write that program involves destructuring. All the cons, car and cdr crap makes my eyes glaze over almost as bad as the js.

More on topic, it seems like at least in the example given, inlining of functions followed by standard optimization would give good results, and I thought we were pretty decent at inlining. Is that just because the example is so simple, and the point is that that ability needs to be more general?

agumonkey · on Aug 25, 2013

I don't know if the author stopped at that level on purpose. That's basic SICP abstraction layer. And I believe types and then pattern matching was another abstraction on top of that to get literal symmetry, but to my eyes at this point it's two faces of the same coin.

Tyr42 · on Aug 25, 2013

I agree. I really like Racket because of the powerful match function, which can do more than Haskell, for comparison. It's like very powerful ViewPatterns.

cwp · on Aug 25, 2013

Well, inlining plus subsequent optimization is exactly what he's talking about. He calls it "specialization." As soon as you start to do that really aggressively, you run into all the issues he brings up.

tomp · on Aug 25, 2013

> Doubtless the first difference that jumps out at you is that one of these programs is written using a beautiful, familiar surface syntax that makes you feel right at home, while the other uses an ugly, foreign notation that makes your eyes glaze over

But I'm not sure all of us would agree which program is beautiful/familiar and which is ugly/foreign...

Edit: I love LISP, but I love math more.

srl · on Aug 25, 2013

I read that as somewhat tongue-in-cheek. I mean, I'm a lisp fan too, but it's not like the syntax is beautiful. (It's just "not as bad". Or that's what we tell ourselves.)

ced · on Aug 25, 2013

I read it as intentionally ambiguous.

Leszek · on Aug 25, 2013

It's almost certainly intentionally ambiguous, and therein lies the joke.

6ren · on Aug 25, 2013

Well. This made me install Firefox 23, to get asm.js. I got a speed-up of x5 on his Mandelbrot code. Wow.

(I had Firefox 20 before. I checked a few websites and didn't notice any speed-up in general.)

PS: of course you can write modular code in JS too, with a complex lib etc (maybe not quite as nice as scheme). Would it be compiled to be as efficient?

yoklov · on Aug 25, 2013

Not unless the compiler did some magic to remove the heap allocations, which iirc none of the JS engines do yet LuaJIT, however, will. see http://wiki.luajit.org/Allocation-Sinking-Optimization . I believe the Dart VM does this too, so it might not be far off for JS engines.

Still, other semantics might slow the code down compared to ASM.js, but the heap allocations will be the biggest.

Someone · on Aug 26, 2013

I know little of the field and nothing of the state of the art, but wonder how smarter compilers would combine with the statistical analysis that profile-guided optimization, the JVM and branch prediction logic do. Do the improvements they bring up add up perfectly, or is smarter compilation more effective for dumber CPUs/runtimes?

fmax30 · on Aug 25, 2013

Interestingly ,I get 0.86 mflops/msec on my 2.1 ghz sanybridge processor on firefox and 0.307 mflops/msec on chrome.

marijn · on Aug 25, 2013

If it's a recent version of Firefox, the "use asm" in [1] probably explains the difference.

[1]: http://alexey.radul.name/ideas/2013/cleverness-of-compilers/...

corysama · on Aug 25, 2013

0.037 mflops/ms in mobile safari on my iPhone 4S

tel · on Aug 25, 2013

How does DVL compare with optimizations GHC runs? Especially re the ad library?

Dewie · on Aug 25, 2013

I'm actually a little disappointed that a mature compiler like javac doesn't seem features such as escape analysis and being able to compile Enum types to primitive values (compiling them to a single byte would be enough for most constants). I might be wrong about these specific cases, but my overall impressions is that compilers like the main one for Java aren't very adventourous when it comes to compiler optimizations. They sometimes seem pretty primitive, even. It's understandable if optimizations would add a lot of time to the compilization, but I don't see why it can't be added as compiler flags at the least, to be used for shipping software.

I have never hacked compilers myself, so I might be totally off the mark on all of this. And the specific things I've mentioned might be much harder than what they seem at a cursory glance. But this post is more of an inquiry into the matter than a definitive statement about the status quo.

ahomescu1 · on Aug 25, 2013

Keep in mind that javac is only half of the compiler, the JVM is the other half. I'm not sure you need escape analysis when you compile down to JVM bytecode. Could you show an example?

Dewie · on Aug 25, 2013

I'm talking about stuff like having a method where you make an object - let's say comprising of two ints, a coordinate - and figures out if it can instead solve the same problem with, say, just making two stack-allocated ints instead of one heap-allocated object. Maybe the programmer wanted to make an object since it looked more readable than making two ints.

kg · on Aug 25, 2013

Java does that, it just happens in the JVM during jit

azakai · on Aug 26, 2013

Yes, that is the pretty standard scalar replacement of aggregates optimization.

filereaper · on Aug 25, 2013

I work on the JIT in JVM's.

javac's a very dumb compiler.

All the advanced optimizations like Escape Anaysis are applied by the JIT within the JVM.