This is a really long and informative article, but I would propose a change to t...

addcninblue · on March 30, 2022

Is it the expected outcome? I was under the initial impression that the author also noted:

> Overall, this may have been a bit of a disappointment to those who expected to use Generics as a powerful option to optimize Go code, as it is done in other systems languages.

where the implementation would smartly inline code and have performance no worse than doing so manually. I quite appreciated the call to attention that there's a nonobvious embedded footgun.

(As a side note, this design choice is quite interesting, and I appreciate the author diving into their breakdown and thoughts on it!)

SomeCallMeTim · on March 30, 2022

In C++, generics (templates) are zero-cost abstractions.

So no, generics do not de facto make code slower.

gmfawcett · on March 30, 2022

That's only 99% of the story. :) Having too many specializations of a C++ template can lead to code bloat, which can degrade cache locality, which can degrade performance.

mcronce · on March 30, 2022

You're definitely right. While it's not a particularly common problem, it does exist; one thing I'd really like to see enter the compiler world is an optimization step to use vtable dispatch (or something akin to Rust's enum_dispatch, since all concrete types should be knowable at compile time) in these cases.

I expect it would require a fair amount of tuning to become useful, but could be based on something analogous to the function inliner's cost model, along with number of calls per type. Could possibly be most useful as a PGO type step, where real-world call frequency with each concrete type is considered.

nu11ptr · on March 30, 2022

enum dispatch in Rust is one of my favorite tricks. Most of the time you have a limited number of implementations, and enum dispatch is often more performant and even less limiting (than say trait objects)

mcronce · on March 30, 2022

I'm a huge fan. It's very little work to use, as long as all variants can be known to the author, and as long as you aren't in a situation where uncommon variants drastically inflate the size of your common variants, it's a performance win, often a big one, compared to a boxed trait object.

Even when you have to box a variant to avoid inflating the size of the whole enum, that's still an improvement over a `dyn Trait` - it involves half as much pointer chasing

It'd be cool to see this added as a compiler optimization - even for cases where the author of an interface can't possibly know all variants (e.g. you have a `pub fn` that accepts a `&dyn MyTrait`), the compiler can

SomeCallMeTim · on March 30, 2022

In my experience, code bloat from templates is overblown.

Inlining happens with or without template classes.

gmfawcett · on March 30, 2022

That's fair. I guess if you need the functionality in your program, you need the functionality: the codegen approach doesn't matter that much. And like pjmlp said, LTO can make a difference too. Thanks for your thoughts, these kinds of exchanges make me smarter. :)

IshKebab · on April 1, 2022

It's still zero cost compared to what you would have done without them - copy and paste the code.

That's what zero cost abstraction means - it doesn't mean that whatever you're writing has no cost, it means the abstraction has no extra costs compared to what you would have to do manually without it.

pjmlp · on March 30, 2022

Depends if LTO is used.

monocasa · on March 30, 2022

There are no true zero cost abstractions under all situations. In the general case they make things faster, but I've personally made C++ code faster by un templating code to relieve I$ pressure, and also allow the compiler to make smarter optimizations when it has less code to deal with. The optimizer passes practically have a finite window they can look at because of the complexity class of a lot of optimizer algorithms.

josefx · on March 30, 2022

C++ can suffer from negative performance from template bloat in two ways:

Templated symbol names are gigantic. This can impact program link and load times significantly in addition to the inflated binary size.

Duplication of identical code for every type, for example the methods of std::vector<int> and std::vector<unsigned int> should compile to the same instructions. There are linker flags that allow some deduplication but those have their own drawbacks, another trick is to actively use void pointers for code parts that do not need to know the type, allowing them to be reused behind a type safe template based API.

fbkr · on March 30, 2022

> There are linker flags that allow some deduplication but those have their own drawbacks

As long as you use --icf=safe I don't see any drawback, and most of the time it results in almost identical reductions to --icf=all since not many real programs compare addresses of functions.

josefx · on March 30, 2022

I think that requires separate function sections, which themselves may cause bloat and data duplication.

fbkr · on March 31, 2022

I, along with everyone in the embedded space, have been using separate function sections forever for --gc-sections and I would be very surprised if they really cause any bloat and duplication at runtime. Do you mean bloat for intermediate files?

josefx · on March 31, 2022

It may be limited to intermediate files, I assumed that the downside would be bigger since it is not a default and the description mentioned that some things may not be merged as well.

koffiezet · on March 31, 2022

At runtime maybe (although that's also not 100% true) - but I've seen a big project go from being compiled in 10 minutes in our ci, to hours due to the introduction of large features heavily relying on templates. The fix was installing a k8s cluster to run the Jenkins build jobs distributed on bare metal nodes, this wasn't exactly zero-cost.

BobbyJo · on March 30, 2022

I think his point was that they definitely won't make it faster (more abstraction means more indirection), so the expectation from most (myself included) would be that using them incurs a performance penalty, maybe not directly via their implementation, but via their use in broader terms.

SomeCallMeTim · on March 30, 2022

Using templates in C++ can make code faster, though. Because you can write the same routine with more abstraction and less indirection.

I've used C++ templates effectively as a code generator to layer multiple levels of abstractions into completely customized code throughout the abstraction.

BobbyJo · on March 31, 2022

> Using templates in C++ can make code faster, though. Because you can write the same routine with more abstraction and less indirection

If we are talking about the same code using generics vs. not generics, one would expect similar or worse performance depending on implementation details, as you are strictly adding indirection or not. Think add two 'ints' vs add two 'T'. Depending on the implementation of generics, you're adding indirection, or not.

If we are talking about leveraging generics to write different code that is more efficient, code that is perhaps infeasible without generics, then yes, totally get what you are saying. I, and I think parent, were referencing the former however, which is maybe not the most helpful way of comparing things :)

> I've used C++ templates effectively as a code generator to layer multiple levels of abstractions into completely customized code throughout the abstraction.

Yeah, I've done the same to inline matrix operations for lidar data processing. Templates are pretty neat since they are completely expanded at compile time. I've yet to look into the details of Golang's generics as far as implementation details go, but since Go has had code generation built in for a while, and it creates static binaries, I imagine it is a very similar system.

EDIT: After reading the part of the post that goes into detail on Go's implementation of generics, it is very similar, but differs when there is indirection on the input types.

chakkepolja · on March 30, 2022

We don't know of a way to implement generic types without (vtable dispatch + boxing) cost AND without monomorphization cost. Some languages do former, some latter, some combination of 2.

Monomorphization: * code bloat * slow compiles * debug builds may be slow (esp c++)

Dynamic dispatch & boxing (Usually both are needed): * not zero cost

Pick your poison

SomeCallMeTim · on March 30, 2022

"Zero-cost" in that context refers to runtime performance. It always refers to runtime performance.

And code bloat, as I've said elsewhere, is vastly overblown as a problem. Another commenter pointed out that link-time optimization removes most of the bloat. The rest is customized code that's optimized per-instantiation.

Slow compiles are an issue with C++ templates. They're literally a Turing-complete code-generation language of their own, and they can perform complex calculations at compile time, so yes, they tend to make compiles take longer when you're using them extensively. But the point I was making was about runtime performance. That's why C++ compilers often perform incremental compilation, which can limit the development time cost.

Debug builds can simply be slow in C++ with or without templates. C++ templates really don't affect debug build runtime performance in any material fashion; writing the code out customized for each given type should have identical performance to the template-generated version of the code, unless there's some obscure corner case I'm not considering.

chakkepolja · on March 31, 2022

> Slow compiles are an issue with C++ templates.

As far as I know

Rust has the same problem although to lesser extent. Monomorphization works well with judicious use. C++ STL is not written like that, they depend on 11111 layers of inlining to work well. Rust libraries aren't much better in this regard.

LTO removed some code bloat, but LTO itself takes more time. until thinLTO summary pass / equivalent pass in GCC WHOPR at least, middle end and early IR optimizations still have to happen, and Go wants to avoid that. I think that's a fine design choice. In Go's design, they have decided virtual calls aren't a cost they'd care anyway, pre 1.8 Go heavily used interfaces and that's not going to change.

> writing the code out customized for each given type should have identical performance to the template-generated version of the code

In theory yeah, but templates tend to generate more instantiations than strictly what you'd write by hand.

Also, obscure corner cases exist, but not big enough, thanks to those numerous man years spent on GCC and LLVM. https://travisdowns.github.io/blog/2020/01/20/zero.html

asvitkine · on March 30, 2022

Zero cost from runtime performance, but you pay binary size for it. It's a trade off between the two...

Ensorceled · on March 30, 2022

Interestingly the original title and your proposed title imply, to me, the opposite of what I think they imply to you. This suggestion is really unclear.