I've worked with a sufficiently smart compiler before: The HLSL (shader) compiler for the Xbox360. It truly is a grand experience. The general recommendation was to not try to outsmart the compiler. Write everything as straight-forward as possible. Just watch out for work done on values that influence the result but aren't really necessary (ex: multiplying by an input that will always be 1).
There were several occasions where I'd write out some elegant math and think "Yeah, that's pretty. But, it would run 30% faster if I transposed everything. But, then I'd have to hand-interleave the internals of all these functions..." Then I'd check the assembly output and Bam! the compiler was already producing exactly what I had in mind!
The downside was that it took several minutes to compile <100 line sources that produce <1000 instruction programs. That's with a language completely specialized around small-scale linear algebra that maps very directly to hardware that is completely specialized around small-scale linear algebra; with branching highly discouraged, all functions calls expected to inline away completely, no recursion, no pointers, no heap, no imports, tiny-if-any headers and a tiny standard lib.
It's my understanding that most of the smartness of the 360's HLSL compiler came from using O(n^3) analysis algorithms that were simply impractical at a larger scale. Our game had thousands of shader programs that were mechanically specialized for different situations. As a result, recompiling all of them after changing the tiny shared header took many hours. "Thankfully" the C++ of the game took so long to compile that we were already using Xoreax Incredibuild to distribute the compilation. We adapted that setup to distribute shader compilation and brought the turn-around time down to 15 minutes by leaching cycles from several dozen of my co-workers' machines.
You can do pretty well for analyses where n is the number of variables instead of the number of program points / expressions. In those cases, the typical trick is to shorten the height of the tracked lattice, as you only get in trouble in a few places.
To make that concrete, control-flow analysis (which answers "which possible functions could I call through this variable?") is cubic in general. But if you cut it off to only tracking N separate functions and then giving up to ANY when you hit that N, you can cut off the few bad cases (e.g., all of the bindings to the function passed to map, which you will probably just inline later anyway) that generate the cubic behavior and keep your execution time well-behaved, even for whole program compilers, at very small loss of precision in practice.
If your language has pointers or aliases, precise call graphs are exceedingly expensive (for large programs), because you need a precise pointer or alias analysis. Do you have a link to a publication about the technique you describe?
There are meatier technical presentations around about the formal bounds, but I always recommend Manuel Serrano's early work, which is much more accessible than more modern presentations (my own included):
A good survey of many of the tricks used in practice (and the formal techniques, as well) is available in Jan Midtgaard's tome, which was finally published in the last year or so:
> Nice, but part of the reason the C version is slow is that you are using abs instead of fabs. abs takes an int as an argument (this should be a compiler warning!) – passing a double other than zero always returns a large number. The C version is basically iterating till the error is below double’s resolution.
> Excellent point! Making the switch to fabs brings the gcc-inline speed down almost exactly to what Stalin is getting.
Part of the reason compilers don't do really thorough analysis and optimization is that it makes the compiles take a long time.
But we can make that not matter if we rethink the relationship between the compiler and the programmer. What if you got the first version of your binary very quickly, but then you could let the compiler run arbitrarily long, and watch it continue to produce better and better versions?
What if the compiler could compile not just source code, but diffs of source code? It could apply your one-line change without redoing its exhaustive analysis of all the things that you couldn't have effected.
Compilers try hard to avoid whole-program analysis, but I think that's misguided. Spend the cycles, they're getting cheaper all the time. If you can update your analysis incrementally as the programmer works, you're golden.
The diff thing sounds good, but there are many layers of optimizations in between the top and the bottom, and so a small change can cascade into a tremendous number of changes. Also, holding the entire tree of intermediate data, at each step, uses a lot of memory.
That said, check out Julia: for any function, you can see how it looks at each step of compilation, down to assembly: source, optimized source, type inferred source (for given input types; it is a dynamic language), LLVM bytecode, and assembly. All visible and introspectable at runtime from either the REPL or scripts.
I am currently working with an in-house language that does only whole program analysis. If you have to be restricted to one or the other, you want separate compilation---the compiler times are just so horrible otherwise. (Optional whole program analysis compilation is fine, though.)
Only in the strictest sense. The mere presence of the switches does nothing on its own to improve the workflow. You need a development environment that is aware of why you're running the compiler: basic syntax checking, deeper static analysis using expensive heuristics, unoptimized code generation for quick turnaround time on correctness tests, or optimized code generation for performance testing.
> Doubtless the first difference that jumps out at you is that one of these programs is written using a beautiful, familiar surface syntax that makes you feel right at home, while the other uses an ugly, foreign notation that makes your eyes glaze over.
I love that sentence. A sharp-witted summary of the recent imperative vs. functional discussion.
I love it too. Having grown up on assembly and C, I have no doubt which of these programs makes me feel right at home and which my eyes glazed over :)
But OTOH I recently worked on porting some ancient non-modular imperative code to work on newer systems and I have to agree that imperative programs can be evil.
For people who are interested in learning about compiler optimizations (they are a wonderful area of research in their own right!), I'm throwing out a list here so you can google about them (if you know more, please reply to this comment! I'd love to learn more about them :D)
Expression related optimizations:
Constant Folding,
Constant Propagation,
Global Propagation,
Strength Reduction,
Common Subexpression Elimination,
Partial Redundancy Elimination,
Induction Variable Elimination,
Reassociation
Control flow related optimizations:
Code block re-ordering (by frequency),
Branch prediction (by static analysis/feedback guided),
Code hoisting/sinking (to optimize CPU pipeline),
Automatic Inlining,
Tail Call Optimization
Code generation related optimizations:
Register allocation (np complete),
Peephole optimization,
Superoptimization (no one really does this yet, but cool nonetheless)
Analysis (interprocedural/intraprocedural):
Alias Analysis (flow/context (in)sensitive),
Points-to Analysis,
Escape Analysis,
du chain Analysis,
Live Variables Analysis,
Memory Access Pattern Analysis (to guide memory layout optimizations),
Available Expressions/Copies Analysis,
Loop Dependency Analysis,
Control Flow Analysis,
Dominator data flow Analysis,
Globals Analysis
Miscellaneous:
Static Single Assignment,
Data Flow Analysis (in, out, transfer, meet!),
Worklist Algorithm (for Data Flow Analysis!),
Symbolic Execution/Abstract Interpretation
Writing straightforward code is the best thing you can do for a compiler. They can tell exactly what you want to do, and do their best to generate the fastest/most memory efficient code possible. Help compiler help you!
Also, compiler usually comes with several different levels of optimizations. And you can tune each one of these optimizations, like you would with a racing car. Read the manual!
It's a test bed for aggressive control-flow based optimization a for Scheme. It's not a practical production compiler--too slow and requires whole program analysis.
Compilers are getting smarter over time. Better analyses and clever optimization schemes are being designed and implemented. In my opinion though, as a compiler writer, the best route is to make languages easier to analyze. Things like purity and referential transparency (e.g.: as seen in Haskell) make code analysis and transformation much simpler. They enable things like pattern-matching type transformations on code. If the compiler has to reason about pointers, invisible side-effects and alias analysis, it greatly limits the kinds of things that can be inferred about a program.
This article was worth it to me if only for the following observation:
And there we have a clue as to why, in modern computing, modular code is not high-performance code. The very advantage of modularity is the source of its performance cost: modularity is the ability to do things other than what you did. Modularity is the ability to reuse the same idea in many different ways and in many different places.
See also Rust's emphasis on zero-cost abstractions [1]. The current compiler doesn't emphasize specialization (though it does specialize generics), but the ability to abstract in source code without leaving any trace in the actual compiled binary is a design goal.
The first encounter I had with automatic differentiation was in a simulation language called PROSE, circa 1974. I never actually used it, but I came across a manual for it, which I found quite fascinating. As I recall, it ran on the CDC 6000 series -- at the time I was doing numerical physics on a 6400, and PROSE was being presented as an alternative to the ubiquitous FORTRAN.
PROSE seems to have been completely forgotten; I can find nothing on it now. But I distinctly recall that it did AD.
Everything in computer science was invented before 1980 :-)
Wow thats quite coincidence, but before I get in to that, I like the the tone on HN much much better today. There were several language related posts, all were reasonably well received. None with the frequent hostility that I have come to see on HN in the recent past. No one, came here to piss and rain smugly on another's voluntary piece of work with "Repeat after me, no one gives a flying f%$#! about your language, where are the libraries, where are the users. Dont waste your time in pointless excercises".
That said, the coincidence that I mentioned is entirely personal. After a hiatus, I am getting my toes wet again in this language that I find quite interesting. Its a whole program analyzed, aggressively inlined, functional, but not purely so, ML like type-checked and type-inferred language with generics that interacts quite effortlessly with C++, without the need for any ffi like library. It has coroutines and preemptive threads.
I think of it as something that does the same to C++ what Scala does to Java or F# does to C#. It is really performant and achieves the speed via aggressive inlining, tail calls, whole program analysis and what can be called opportunistic but indeterminate laziness. I have not quite grokked its model yet, one thing that I want to get a handle on is to be able to reason what triggers garbage collection and what doesn't. According to the author of the language, garbage collection can be mostly avoided, but I am not yet good at this aspect, but its just been a few days that I have actually used the language (as opposed to reading about it).
Its statically typechecked but feels unusually flexible in one aspect: one can move functions and types around the between and across different translation units freely (for refactoring) without the need for forward declarations because the linkage semantics are permutation independent.
Personally, the way I want to write that program involves destructuring. All the cons, car and cdr crap makes my eyes glaze over almost as bad as the js.
More on topic, it seems like at least in the example given, inlining of functions followed by standard optimization would give good results, and I thought we were pretty decent at inlining. Is that just because the example is so simple, and the point is that that ability needs to be more general?
I don't know if the author stopped at that level on purpose. That's basic SICP abstraction layer. And I believe types and then pattern matching was another abstraction on top of that to get literal symmetry, but to my eyes at this point it's two faces of the same coin.
I agree.
I really like Racket because of the powerful match function, which can do more than Haskell, for comparison. It's like very powerful ViewPatterns.
Well, inlining plus subsequent optimization is exactly what he's talking about. He calls it "specialization." As soon as you start to do that really aggressively, you run into all the issues he brings up.
> Doubtless the first difference that jumps out at you is that one of these programs is written using a beautiful, familiar surface syntax that makes you feel right at home, while the other uses an ugly, foreign notation that makes your eyes glaze over
But I'm not sure all of us would agree which program is beautiful/familiar and which is ugly/foreign...
I read that as somewhat tongue-in-cheek. I mean, I'm a lisp fan too, but it's not like the syntax is beautiful. (It's just "not as bad". Or that's what we tell ourselves.)
Well. This made me install Firefox 23, to get asm.js. I got a speed-up of x5 on his Mandelbrot code. Wow.
(I had Firefox 20 before. I checked a few websites and didn't notice any speed-up in general.)
PS: of course you can write modular code in JS too, with a complex lib etc (maybe not quite as nice as scheme). Would it be compiled to be as efficient?
Not unless the compiler did some magic to remove the heap allocations, which iirc none of the JS engines do yet LuaJIT, however, will. see http://wiki.luajit.org/Allocation-Sinking-Optimization . I believe the Dart VM does this too, so it might not be far off for JS engines.
Still, other semantics might slow the code down compared to ASM.js, but the heap allocations will be the biggest.
I know little of the field and nothing of the state of the art, but wonder how smarter compilers would combine with the statistical analysis that profile-guided optimization, the JVM and branch prediction logic do. Do the improvements they bring up add up perfectly, or is smarter compilation more effective for dumber CPUs/runtimes?
I'm actually a little disappointed that a mature compiler like javac doesn't seem features such as escape analysis and being able to compile Enum types to primitive values (compiling them to a single byte would be enough for most constants). I might be wrong about these specific cases, but my overall impressions is that compilers like the main one for Java aren't very adventourous when it comes to compiler optimizations. They sometimes seem pretty primitive, even. It's understandable if optimizations would add a lot of time to the compilization, but I don't see why it can't be added as compiler flags at the least, to be used for shipping software.
I have never hacked compilers myself, so I might be totally off the mark on all of this. And the specific things I've mentioned might be much harder than what they seem at a cursory glance. But this post is more of an inquiry into the matter than a definitive statement about the status quo.
Keep in mind that javac is only half of the compiler, the JVM is the other half. I'm not sure you need escape analysis when you compile down to JVM bytecode. Could you show an example?
I'm talking about stuff like having a method where you make an object - let's say comprising of two ints, a coordinate - and figures out if it can instead solve the same problem with, say, just making two stack-allocated ints instead of one heap-allocated object. Maybe the programmer wanted to make an object since it looked more readable than making two ints.
There were several occasions where I'd write out some elegant math and think "Yeah, that's pretty. But, it would run 30% faster if I transposed everything. But, then I'd have to hand-interleave the internals of all these functions..." Then I'd check the assembly output and Bam! the compiler was already producing exactly what I had in mind!
The downside was that it took several minutes to compile <100 line sources that produce <1000 instruction programs. That's with a language completely specialized around small-scale linear algebra that maps very directly to hardware that is completely specialized around small-scale linear algebra; with branching highly discouraged, all functions calls expected to inline away completely, no recursion, no pointers, no heap, no imports, tiny-if-any headers and a tiny standard lib.
It's my understanding that most of the smartness of the 360's HLSL compiler came from using O(n^3) analysis algorithms that were simply impractical at a larger scale. Our game had thousands of shader programs that were mechanically specialized for different situations. As a result, recompiling all of them after changing the tiny shared header took many hours. "Thankfully" the C++ of the game took so long to compile that we were already using Xoreax Incredibuild to distribute the compilation. We adapted that setup to distribute shader compilation and brought the turn-around time down to 15 minutes by leaching cycles from several dozen of my co-workers' machines.