What Is the Minimal Set of Optimizations Needed for Zero-Cost Abstraction?

modeless · on Aug 7, 2020

The reason compilation time is a problem is that compiler architecture is stuck in the 1970s, at least for systems programming languages. Compilers make you wait while they repeat the same work over and over again, thousands of times. Trillions, if you count the work globally. How many times has <vector> been parsed over the years?

In a world where compiler architecture had actually advanced, you would download a precompiled binary alongside the source, identical to what you would have produced with a clean build, and that binary would be updated incrementally as you type code. If you made a change affecting a large part of the binary that couldn't be applied immediately, JIT techniques would be used to allow you to run and test the program anyway before it finished compiling.

There is no fundamental reason why anyone should ever have to wait for a compiler. And if you didn't have to wait, then it would free the compiler to spend potentially much more time doing optimizations, actually improving the final binary.

The zapcc project shows a bit of the potential for improvement in build times, though it's just scratching the surface. https://github.com/yrnkrn/zapcc

pas · on Aug 7, 2020

A bit tangential, but

- an ex-Scala core contributor's lament about why compilers have to be incremental: https://www.youtube.com/watch?v=TS1lpKBMkgg&t=42m

- a Rust core team member talking about responsive compilers: https://www.youtube.com/watch?v=N6b44kMS6OM (basically the "low-level" implementation of the talk at the previous link)

sixstringtheory · on Aug 7, 2020

It seems to me that this is what linkers and static/dynamic libs are for. Why does it also then have to exist in the compiler?

I suppose there's the problem of optimization across module boundaries, but if you solve that problem for compilers, might as well solve it for linkers.

What you're describing sounds like my current best practice for developing iOS apps. I have multiple static libraries for separate logical code units that don't have to be recompiled when unrelated code is changed. This is for C, C++, ObjC and Swift. All iOS package managers have the option to download compiled binaries, as well.

api · on Aug 7, 2020

There's a lot of much lower hanging fruit. As you state: there is no reason the source has to be parsed from scratch every single time. A little memo-ization of intermediate data structures when the underlying source has not changed could go a long way.

AFAIK Microsoft's toolchains do this but they don't do it well. It's a frequent source of headaches.

benrbray · on Aug 7, 2020

I think this is an area where practice has lagged quite far behind theory (for entirely reasonable practical reasons). I've seen quite a few papers relating to incremental parsing, compilation, etc. (e.g. [1,2]) but afaik very few build systems actually take advantage.

[1] https://arxiv.org/abs/1312.0658 [2] https://blog.functorial.com/posts/2018-04-08-Incrementally-I...

sumtechguy · on Aug 7, 2020

In some ways it can be the language fault too

#define TURNONFEAT 1 #include <xyz.h>

or #define TURNONFEAT 0 #include <xyz.h>

That can create wildly different includes. The upcoming modules feature in C++ may change that. But unfortunately at this point we get to live with a recompile.

Pre compiled headers took some of the sting out of it. It would be nice if they could say 'in this case at this point these flags existed and I used them I will hold onto that for this header'. That way you compile it once and the rest of the project can change things in and out. The pch way seems more 'i know I will not change the flags ever'. It was a good start. Maybe a new object type and it could even live on between compiles?

mshockwave · on Aug 7, 2020

dunno why you got downvote, I thought your point was pretty legit. IMAO, we might even need to change common binary file formats (e.g. ELF, COFF) to accommodate this sort of feature

mncharity · on Aug 7, 2020

In addition to improved local compilation, the constraint space for compiler design is also changed by cloud compilation. Distributed deterministic compilation like [1] would permit community-level caching. Distribution reduces the importance of peak computes - if prompt compilation requires a hundred cores, that could be ok. Community caching reduces the importance of worst-case performance - if some rich type analysis of a standard library takes days, that might be tolerable if it only needs to happen once. If optimization decisions can become a discrete things-on-the-side, like current execution traces for jit, then there's a new opportunity for human-compiler collaboration - "lay out the data in memory like so, when the cpu cache has this shape". I'm looking forward to the ferment of hobby languages starting to explore this new space. What would you do differently in designing a language, if the language community was served by a single giant instance of a compiler? How might that change the economics of language development? PLAAS?

[1] https://github.com/StanfordSNR/gg Sort of checksums-and-argument-lists to make gcc deterministic, as for a cache, but farmed out so `make -j100` runs on amazon Lambda.

greggman3 · on Aug 8, 2020

You lost me.... <vector> isn't executable code. It's generic instructions for how to generate new code given a type and at least in C++ that code is not and can not be shared because what is generated depends on the type itself.

cjfd · on Aug 8, 2020

Well if vector<SomeType> is used in many cpp files it is compiled separately for each and every one of them and then the linker throws away these gigantic amounts of code duplication between the various object files. Also vector<shared_ptr<T>> really only needs to be compiled once for all variations of T. I don't think much or even anything is to be gained by having copies of this code in the executable for every T.

pjmlp · on Aug 8, 2020

Only if not using external templates, or there is a missing declaration for SomeType.

geocar · on Aug 8, 2020

When a C++ compiler sees a statement like template<class T>... it updates its internal state to be able to recognise the constructions this statement is intended to enable.

If you are able (somehow) to encapsulate this state, and save and load it faster than you can regenerate it, that’s a win. You can call it caching or memoisation or “incremental compilation” if you really enjoy that sort of thing, but this is such a basic concept that if you cannot understand it, you are going to have a lot of other problems.

msla · on Aug 8, 2020

> If you are able (somehow) to encapsulate this state, and save and load it faster than you can regenerate it, that’s a win. You can call it caching or memoisation or “incremental compilation” if you really enjoy that sort of thing, but this is such a basic concept that if you cannot understand it, you are going to have a lot of other problems.

If you're going to be condescending like that, it would be nice if you could provide a concrete example of how to incrementally compile vector<T> in the general case in such a way that it's useful and reasonable to redistribute (that is, not tied to a specific architecture or compiler) instead of just waving your hands.

geocar · on Aug 10, 2020

Yes, it would be nice for you if I did things for you for free, however it would not be nice for me.

If you're writing a compiler and want specific advice about how to do this, by all means reach out via email and we'll work something out.

msla · on Aug 21, 2020

You've yet to demonstrate you have the knowledge to do anything useful.

qznc · on Aug 7, 2020

So the image-based persistence of Smalltalk and Common Lisp should be revived? I don't know what the downsides are.

TeMPOraL · on Aug 8, 2020

My experience from Common Lisp development:

The main downside is that it's easy to make your image out of sync with your code. That is, the state of the image during development (global variables and their values, functions, classes, environment settings) may end up different than what you'd have if you restarted the image and reloaded the code into it - which can lead to an unexpected cascade of bugs the next time you restart the image.

In practice, you can guard against it with unit and integration tests, as well as just reading warnings the compiler gives you.

Still, I don't think being image-based will help C++ as long as templates work the way they do - templates are code generators, so each time you change one, you have to recompile all its potential users. The way CL implementations tend to work, "building the image" is similar to C++'s linking process - i.e. whenever a Lisp file is compiled, the compilation output is cached, so it can be loaded quicker if the file didn't change. But if you change a widely-used macro, you end up having to recompile almost everything from scratch, which is similar to C++ and changes to header files.

pjmlp · on Aug 8, 2020

Yes, that is what IDEs actually make with their AST databases, and what Energize C++ and VC++ v4.0 tried to bring into mainstream.

klyrs · on Aug 8, 2020

> and that binary would be updated incrementally as you type code

Are you joking here? While I'm typing is the worst time to compile code. Much of the time, my code contains syntax errors because, for example, I'm in the middle of writing a line of code. I even write such "bad" code to the filesystem several times before deciding to compile, so I'd be quite annoyed if my computer churned on every file-save.

salawat · on Aug 8, 2020

You seem to be suffering from magical thinking.

Where does this blob come from, how is it that it can exist when you haven't written the code yet?

I know that the Internet makes it seem as if data access is automatic and magic; however, fundamentally, computing is a physical process. Bits don't shuffle by magic, and they can't even be targeted for download to avoid compiling before compiling and checking the result against some sort of sample or list for whether or not the chunk of code already exists. No avoidance of compilation achieved.

Compilation/interpretation is not something you're ever going to make go away unless you like programming with straight 0's and 1's.

doonesbury · on Aug 7, 2020

Maybe making precompiled headers for all the boost/std stuff is needed? That is, it's around but not often used is my point.

pjmlp · on Aug 8, 2020

It is used a lot in Windows compilers since Windows 3.0 days.

Apparently other platforms never bothered that much into making work properly.

pjmlp · on Aug 8, 2020

The exception being Ada compilers, but then again most don't get to use them.

platz · on Aug 8, 2020

also sounds like Unison

amelius · on Aug 7, 2020

One problem is that Rust and C++ are not the best languages to build incremental compilers in, because these languages are not purely functional and on top of that the developer has to think about memory management all the time (in Rust the type-system limits the developer because of memory management which is more or less the same thing).

secondcoming · on Aug 7, 2020

This argument makes no sense

heavenlyblue · on Aug 8, 2020

It only doesn’t make sense because the author forgot to mention monads!

haimez · on Aug 7, 2020

GPT-3

dropit_sphere · on Aug 8, 2020

I suspect "Please do not imply that another commenter is a neural net" will be one of the next items added to HN's guidelines.

fanf2 · on Aug 7, 2020

My list would be:

- Inlining

- Common subexpression elimination

- Dead code elimination

- Monomorphization

I get the impression that 2 and 3 come out of an SSA middle end fairly straightforwardly, along with the article’s copy propagation and dead store elimination.

The biggie is monomorphization, which is part of the language in C++. The alternative is uniform representation of generic values, which implies boxing and indirection instead. I understand Rust does not hide representations enough for the compiler to be able to choose whether or not to monomorphize. Interestingly, the Go generics discussion seems to be heading towards a design where the compiler does have this option.

mshockwave · on Aug 8, 2020

+1 for Monomorphization. Though more and more code base try to avoid virtual dispatching by either not using it at all or using template tricks instead

pornel · on Aug 8, 2020

OTOH monomorphisation multiplies amount of code to optimize. For large functions that is problematic.

Rust's standard library uses a pattern (written manually) of "hoisting" generic code out of large functions, e.g.

    fn generic<P: AsRef<Path>>(path: P) {
       non_generic(path.as_ref());
    }

    fn non_generic(path: &Path) {
      // lots of code
    }

That's cheaper to compile and reduces binary size compared to having one large generic function monomorphized for every AsRef<Path> type.

It'd be great to have this as a compiler optimization to automatically have "half-monomorphic" functions.

matt_d · on Aug 8, 2020

FWIW, this technique has been known in the C++ community as "hoisting", too (often applied to templates, particularly container class templates, but also smart pointers); cf. "Designing and Coding Reusable C++" by Carroll and Ellis, 1995 (http://cpptips.com/hoisting).

"Given this, what techniques can be used to reduce template instantiation time? One technique is called "hoisting." This is a generalization of the "wrappers for pointer containers" technique that showed up as a tip within the past week or so. The idea is quite simple: when writing a template class, consider each method in turn. If the method does not depend upon the template parameters, split the template class into a non-template base and a template derived class. Move (hoist) these parameter independent methods up into the base class. Note that with experience, you will begin to recognize opportunities for "hoisting" even in cases where methods initially appear to depend upon template parameters."

See also "thin template" (demonstrating the technique for containers, together with a Symbian OS adoption example, https://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Thin_Templ...) as well as "Minimizing Dependencies within Generic Classes for Faster and Smaller Programs", Dan Tsafrir, Robert W. Wisniewski, David F. Bacon, and Bjarne Stroustrup (ACM OOPSLA 2009). https://www.stroustrup.com/SCARY.pdf

"Reducing bloat by replacing inner classes with aliases can be further generalized to also apply to member methods of generic classes, which, like nested types, might uselessly depend on certain type parameters simply because they reside within a generic class’s scope. (Again, causing the compiler to uselessly generate many identical or nearly-identical instantiations of the same method.) To solve this problem we propose a “generalized hoisting” design paradigm, which decomposes a generic class into a hierarchy that eliminates unneeded dependencies."

cesarb · on Aug 8, 2020

I have seen a similar optimization in GHC, where it's called "worker-wrapper", and used to reduce the cost of Haskell allocating everything on the heap. It splits each function into a "wrapper" which unboxes all arguments and boxes the return value, and a "worker" which actually does all the work on the unboxed values. The "wrapper" can then be inlined, which allows the compiler to remove unnecessary boxing (boxing a value only to immediately unbox it again).

heavenlyblue · on Aug 8, 2020

Shouldn’t that be pretty easy to automate in the compiler?

davidcuddeback · on Aug 7, 2020

> I understand Rust does not hide representations enough for the compiler to be able to choose whether or not to monomorphize.

Any chance you could expand on that?

fanf2 · on Aug 7, 2020

I don't know enough Rust to explain clearly or give examples :-) But like C++, Rust makes it explicit whether an object is boxed or not and the compiler doesn't get a choice about the memory layout, so it can't switch between monomorphization or uniform representation depending on the optimization level.

necubi · on Aug 7, 2020

Yeah in Rust it's not up to the compiler whether generic code is monomorphized or dynamic dispatched, it's part of the language.

A monomorphized function looks like this:

    fn f1<T: Display>(v: &T) -> String {
        format!("{}", v)
    }

A dynamic dispatched function looks like:

    fn f2(v: &dyn Display) -> String {
        format!("{}", v)
    }

Note that we've added the dyn keyword, which tells rust to treat this as a trait object and use dynamic dispatch.

This is actually separate from boxing (these examples are just using normal unboxed references), although typically you would use trait objects with boxing as unboxed trait objects are awkward to work with.

See https://godbolt.org/z/PvM3ox for an example of the compilation: we get two versions of f1 specialized for u32 and u64, but just one for f2.

davidcuddeback · on Aug 7, 2020

I'm so used to choosing between monomorphization and dynamic dispatch myself that I never thought of the possibility of a compiler being able to choose. Is the point to avoid monomorphization during dev to have faster compile times, and then enable monomorphization for release builds? That's a neat idea.

That makes me wonder what Java does these days. I haven't used Java much since about v1.6. Java took a different approach than C++ and Rust. I remember when generics were introduced, they were basically syntactic sugar for type casts (for backwards compatibility). That implies that Java at the time didn't monomorphize generics. I think the language has evolved a lot since those days, so I wonder if it does any monomorphization today.

kjeetgill · on Aug 8, 2020

Java is slightly more complicated. It can make assumptions about the code (even if it can't prove they always hold!) and specialize some things but not others.

Generally, the jit can specialize the code making assumptions about the real type under interfaces/polymorphism, which types the generic parameters will be, or even which target virtual dispatch will always hit.

But for the objects themselves, it never specializes layout in any way (short of removing an allocation entirely).

... I think. I'm no expert in this, ask chrisseaton.

toolslive · on Aug 7, 2020

- Should there be a reference to C++/Rust in the title ?

> Tail calls: I can't think of a Rust or C++ abstraction that relies on TCO to be zero-cost.

This is an interesting one. Doesn't the absence of TCO mean that there are a lot of (recursive) abstractions that cannot ever be zero cost ? First thing that comes to mind is processing of composites (for example serialization of a protobuf type definition). There are others.

roca · on Aug 7, 2020

In my experience, Rust and C++ developers seldom use tail recursion with the expectation that it's zero-cost. In almost all cases it's much more natural to use a loop in those languages.

swsieber · on Aug 7, 2020

Can't one fake TCO by hoisting things out into a loop?

> Tail calls: I can't think of a Rust or C++ abstraction that relies on TCO to be zero-cost.

Rust doesn't have TCO, so it's hard to rely on something that doesn't exist (unless it's on nightly and I missed it).

steveklabnik · on Aug 7, 2020

Rust doesn't guarantee TCO, but LLVM may decide to do it if it wishes.

Gibbon1 · on Aug 7, 2020

Lack of a guarantee says you should use explicit loops.

majewsky · on Aug 7, 2020

That's assuming the tail call is visible in human-level source code. You could also have a situation where A tail-calls B, B tail-calls C and C tail-calls A. If B and C get inlined, you now have the opportunity to convert A's self-tail-call into a loop. Refactoring A manually may not be desirable, e.g. if inlining B and C cause significant code duplication, or if they live in entirely different libraries.

heavenlyblue · on Aug 8, 2020

This is obviously a mistake if that happened, are you proposing for the compilers to “fix” this bug automatically until someone changes one of these functions to not do a tail call and it explodes out of the blue?

majewsky · on Aug 9, 2020

Why would that be a mistake?

steveklabnik · on Aug 8, 2020

I agree. Not sure why you're downvoted.

nicoburns · on Aug 7, 2020

Yeah, I thought this was a bizarre one to exclude. Especially as I can't imagine it has a particularly large effect on compile time. At least not if it's enabled by a keyword.

toolslive · on Aug 8, 2020

What I think happened is rationalization: the early versions of the rust compiler omitted TCO. "It doesn't fit with the rest of it. We might revisit this later". Later, it turned out too difficult to retrofit, and now the stance is "you don't need it".

https://dev.to/seanchen1991/the-story-of-tail-call-optimizat...

CyberDildonics · on Aug 8, 2020

Why use recursion for iteration if the language already makes iteration easy?

Kinrany · on Aug 7, 2020

Aggressive inlining keeps coming up.

There's Spiral [0]: a language where inlining is the default and also a first-class concept, much like immutability in new popular languages like Rust.

As a bonus, the commit messages contain a whole journal worth of notes.

[0]: https://github.com/mrakgr/The-Spiral-Language

Animats · on Aug 8, 2020

The expensive optimizations are the ones that require non-local analysis.

Register allocation is classically a win for compile speed. It takes less compiler work to put values in registers than to emit all the loads and stores and pushes and pops for putting them on the stack.

blaisio · on Aug 7, 2020

Looking at it from this angle is a very interesting idea. Languages like Go have to avoid a lot of abstractions to avoid increasing compile time. If we can narrow down the specific optimizations required, we can make better decisions about the abstractions that could be supported.

proverbialbunny · on Aug 7, 2020

Speaking of zero-cost abstractions. How many languages have this feature? Do only C++ and Rust support zero-cost abstractions?

benrbray · on Aug 8, 2020

I don't know how truly "zero-cost" the compiler implementations are, but the premise of functional programming languages like Haskell seems to be that programmers should be able to freely think in abstractions like map, fold, monads, etc.. without worrying about performance.

proverbialbunny · on Aug 8, 2020

The benefit of zero cost is creating types that are not native, yet have the speed of native types.

Eg, I do a lot of financial programming, so I need a decimal type, which is like a float, but base 10 so math with no floating point errors.

When I turn on compiler optimizations on zero cost languages, I get roughly a 15-20x speedup, depending on what it is. Interestingly, languages without zero cost abstractions, but are still quite fast, eg Java, run at the speed of the zero cost languages with optimizations turned off.

Zero cost is not helpful in most use cases, but in my situation it gives quite a large speed boost. Sadly, I feel like I'm locked in to using Rust (or C++) and don't have much of a choice. I suspect Haskell is going to be quite a bit slower for creating custom types unfortunately.

benrbray · on Aug 8, 2020

Interesting, thanks for your perspective! Regarding speed, I suspect there might be more to the story since so many finance firms [1] are known to use Haskell internally. Perhaps they're not using it for anything speed-critical, though.

[1] https://wiki.haskell.org/Haskell_in_industry

proverbialbunny · on Aug 10, 2020

Keep in mind finance is not quantitative finance. (edit: I just realized I didn't specify above. I'm doing quant work.)

In finance (and quant work) you don't want a mistake. You want what you code to have provability, so you know what you write is what you get. This is super important. Speed in pure finance is not important.

In quantitative finance the common practice today is companies roll out their own programming language that has provability like Haskell, but is closer to the speed of C++. These languages tend to be a big closer to lisp than Haskell, but it depends from company to company.

andoma · on Aug 7, 2020

gcc have -Og which turns on optimization steps that can be enabled without sacrificing debugability.

slavik81 · on Aug 7, 2020

In practice, I find I usually have to switch to -O0, as nearly all my variables are <optimized out> with -Og.

pm215 · on Aug 7, 2020

That's been my experience too. Unfortunately the gcc devs essentially poisoned their own well there -- I'm now much less likely to want to go back and try -Og because of my initial bad experience. If they'd started with "always good debug and very few optimisations" and only added in optimisations when they didn't break the debug illusion then the user-experience would have been much better (always at least as good as -O0 when debugging, and gradually gets faster as you use newer gcc versions).

ndesaulniers · on Aug 7, 2020

> actually configure the Rust compiler with a promising set of optimizations and evaluate the results.

LLVM lets you configure an optimization pipeline of passes. I'd hack up my own -O flag and pipeline. It would be cool to see these suggestions tested out.

steveklabnik · on Aug 7, 2020

You don't need code changes to try this out, see

* https://doc.rust-lang.org/stable/rustc/codegen-options/index...

ndesaulniers · on Aug 8, 2020

Oh cool! Happy to see you land on your feet btw, new gig looks cool!

steveklabnik · on Aug 9, 2020

Thank you! It's been really really great :)

ncmncm · on Aug 7, 2020

I think common sub-expression elimination is a part of hoisting out of loops, which seems pretty important.

I have seen Gcc dismantle one loop into two adjacent loops, the first to pre-compute a table for the second loop. My jaw was on the floor.

roca · on Aug 7, 2020

Can you give an example where CSE/hoisting is necessary to make an abstraction zero-cost? I mean: give an example of a zero-cost abstraction that takes some hand-written code and packages it up nicely, and then show that the compiler can't compile it back down to the hand-written code without CSE/hoisting.

fanf2 · on Aug 8, 2020

The usual example is something like a safe array where the bounds are checked on every access. The bounds check is a common subexpression in code that accesses the array more than once. If the abstraction includes a safety check on access that doesn’t depend on the loop variable (eg if empty arrays are NULL) then that can be hoisted.

brandmeyer · on Aug 7, 2020

"copy propagation" sounds like he's got two things mixed up: Copy elision and constant propagation. Each one of these are important for zero-cost abstractions, IMO.

roca · on Aug 7, 2020

I mean this: https://en.wikipedia.org/wiki/Copy_propagation

brandmeyer · on Aug 11, 2020

With respect to zero-cost abstractions, its copy elision that matters. operator= is just another function in C++, for example. Only in a handful of specific circumstances is the compiler allowed to elide calls to operator=.