This is a really long and informative article, but I would propose a change to the title here, since "Generics can make your Go code slower" seems like the expected outcome, where the conclusion of the article leans more towards "Generics don't always make your code slower", as well as enumerating some good ways to use generics, as well as some anti-patterns.
Is it the expected outcome? I was under the initial impression that the author also noted:
> Overall, this may have been a bit of a disappointment to those who expected to use Generics as a powerful option to optimize Go code, as it is done in other systems languages.
where the implementation would smartly inline code and have performance no worse than doing so manually. I quite appreciated the call to attention that there's a nonobvious embedded footgun.
(As a side note, this design choice is quite interesting, and I appreciate the author diving into their breakdown and thoughts on it!)
That's only 99% of the story. :) Having too many specializations of a C++ template can lead to code bloat, which can degrade cache locality, which can degrade performance.
You're definitely right. While it's not a particularly common problem, it does exist; one thing I'd really like to see enter the compiler world is an optimization step to use vtable dispatch (or something akin to Rust's enum_dispatch, since all concrete types should be knowable at compile time) in these cases.
I expect it would require a fair amount of tuning to become useful, but could be based on something analogous to the function inliner's cost model, along with number of calls per type. Could possibly be most useful as a PGO type step, where real-world call frequency with each concrete type is considered.
enum dispatch in Rust is one of my favorite tricks. Most of the time you have a limited number of implementations, and enum dispatch is often more performant and even less limiting (than say trait objects)
I'm a huge fan. It's very little work to use, as long as all variants can be known to the author, and as long as you aren't in a situation where uncommon variants drastically inflate the size of your common variants, it's a performance win, often a big one, compared to a boxed trait object.
Even when you have to box a variant to avoid inflating the size of the whole enum, that's still an improvement over a `dyn Trait` - it involves half as much pointer chasing
It'd be cool to see this added as a compiler optimization - even for cases where the author of an interface can't possibly know all variants (e.g. you have a `pub fn` that accepts a `&dyn MyTrait`), the compiler can
That's fair. I guess if you need the functionality in your program, you need the functionality: the codegen approach doesn't matter that much. And like pjmlp said, LTO can make a difference too. Thanks for your thoughts, these kinds of exchanges make me smarter. :)
It's still zero cost compared to what you would have done without them - copy and paste the code.
That's what zero cost abstraction means - it doesn't mean that whatever you're writing has no cost, it means the abstraction has no extra costs compared to what you would have to do manually without it.
There are no true zero cost abstractions under all situations. In the general case they make things faster, but I've personally made C++ code faster by un templating code to relieve I$ pressure, and also allow the compiler to make smarter optimizations when it has less code to deal with. The optimizer passes practically have a finite window they can look at because of the complexity class of a lot of optimizer algorithms.
C++ can suffer from negative performance from template bloat in two ways:
Templated symbol names are gigantic. This can impact program link and load times significantly in addition to the inflated binary size.
Duplication of identical code for every type, for example the methods of std::vector<int> and std::vector<unsigned int> should compile to the same instructions. There are linker flags that allow some deduplication but those have their own drawbacks, another trick is to actively use void pointers for code parts that do not need to know the type, allowing them to be reused behind a type safe template based API.
> There are linker flags that allow some deduplication but those have their own drawbacks
As long as you use --icf=safe I don't see any drawback, and most of the time it results in almost identical reductions to --icf=all since not many real programs compare addresses of functions.
I, along with everyone in the embedded space, have been using separate function sections forever for --gc-sections and I would be very surprised if they really cause any bloat and duplication at runtime. Do you mean bloat for intermediate files?
It may be limited to intermediate files, I assumed that the downside would be bigger since it is not a default and the description mentioned that some things may not be merged as well.
At runtime maybe (although that's also not 100% true) - but I've seen a big project go from being compiled in 10 minutes in our ci, to hours due to the introduction of large features heavily relying on templates. The fix was installing a k8s cluster to run the Jenkins build jobs distributed on bare metal nodes, this wasn't exactly zero-cost.
I think his point was that they definitely won't make it faster (more abstraction means more indirection), so the expectation from most (myself included) would be that using them incurs a performance penalty, maybe not directly via their implementation, but via their use in broader terms.
Using templates in C++ can make code faster, though. Because you can write the same routine with more abstraction and less indirection.
I've used C++ templates effectively as a code generator to layer multiple levels of abstractions into completely customized code throughout the abstraction.
> Using templates in C++ can make code faster, though. Because you can write the same routine with more abstraction and less indirection
If we are talking about the same code using generics vs. not generics, one would expect similar or worse performance depending on implementation details, as you are strictly adding indirection or not. Think add two 'ints' vs add two 'T'. Depending on the implementation of generics, you're adding indirection, or not.
If we are talking about leveraging generics to write different code that is more efficient, code that is perhaps infeasible without generics, then yes, totally get what you are saying. I, and I think parent, were referencing the former however, which is maybe not the most helpful way of comparing things :)
> I've used C++ templates effectively as a code generator to layer multiple levels of abstractions into completely customized code throughout the abstraction.
Yeah, I've done the same to inline matrix operations for lidar data processing. Templates are pretty neat since they are completely expanded at compile time. I've yet to look into the details of Golang's generics as far as implementation details go, but since Go has had code generation built in for a while, and it creates static binaries, I imagine it is a very similar system.
EDIT: After reading the part of the post that goes into detail on Go's implementation of generics, it is very similar, but differs when there is indirection on the input types.
We don't know of a way to implement generic types without (vtable dispatch + boxing) cost AND without monomorphization cost. Some languages do former, some latter, some combination of 2.
Monomorphization: * code bloat * slow compiles * debug builds may be slow (esp c++)
Dynamic dispatch & boxing (Usually both are needed): * not zero cost
"Zero-cost" in that context refers to runtime performance. It always refers to runtime performance.
And code bloat, as I've said elsewhere, is vastly overblown as a problem. Another commenter pointed out that link-time optimization removes most of the bloat. The rest is customized code that's optimized per-instantiation.
Slow compiles are an issue with C++ templates. They're literally a Turing-complete code-generation language of their own, and they can perform complex calculations at compile time, so yes, they tend to make compiles take longer when you're using them extensively. But the point I was making was about runtime performance. That's why C++ compilers often perform incremental compilation, which can limit the development time cost.
Debug builds can simply be slow in C++ with or without templates. C++ templates really don't affect debug build runtime performance in any material fashion; writing the code out customized for each given type should have identical performance to the template-generated version of the code, unless there's some obscure corner case I'm not considering.
Rust has the same problem although to lesser extent. Monomorphization works well with judicious use. C++ STL is not written like that, they depend on 11111 layers of inlining to work well. Rust libraries aren't much better in this regard.
LTO removed some code bloat, but LTO itself takes more time. until thinLTO summary pass / equivalent pass in GCC WHOPR at least, middle end and early IR optimizations still have to happen, and Go wants to avoid that. I think that's a fine design choice. In Go's design, they have decided virtual calls aren't a cost they'd care anyway, pre 1.8 Go heavily used interfaces and that's not going to change.
> writing the code out customized for each given type should have identical performance to the template-generated version of the code
In theory yeah, but templates tend to generate more instantiations than strictly what you'd write by hand.
Interestingly the original title and your proposed title imply, to me, the opposite of what I think they imply to you. This suggestion is really unclear.