Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Print(“lol”) doubled the speed of my Go function (medium.com/ludirehak)
255 points by ludiludi on Aug 24, 2023 | hide | past | favorite | 126 comments


I think this is a case of mis-assigned blame (on the tool’s part, not the author’s). My semi-educated guesswork follows:

Looking at the disassembly screenshots in the article, the total runtime for the benchmark doesn’t appear to have decreased by very much. “Time per op” has decreased by half in max_lol(), but the total number of ops being performed has likely increased, too - Specifically, extra work was done “for free” (As was shown in min_max).

This experiment is showing us that the compiler is in fact doing exactly what we want - maximizing throughput in the face of a stall by pipelining!

In this experiment, maxV is potentially being written to with each iteration of the loop. Valid execution of the next iteration requires us to wait on that updated value of maxV. This comparison and write takes longer than just running an instruction - it’s a stall point.

In the first profile, the compare instruction gets full credit for all the time the CPU is stalled waiting on that value to be written - there’s nothing else it can be doing at the time.

In the other profiles, we see a more “honest” picture of how long the comparison takes. After the compare instruction is done, we move on to other things which DON’T rely on knowing maxV - printing “lol”, or doing the other compare and write for minV.

I propose that the processor is doing our added work on every iteration of the loop, no matter what (And why not? Nothing else would be happening in that time). That BLT instruction isn’t making things faster, it’s just deciding whether to throw away the results of our extra work or keep it.

Throughput is important, but not always the same thing as speed. It’s good to keep that in mind with metrics like ops/time, particularly if the benchmarking tool tries to blame stuff like cache misses or other stalls on a (relatively) innocent compare instruction!


Yes, this part seems wrong

> Following standard practice, I use the benchstat tool to compare their speeds

That tool (at least as used here) would be suitable for comparing execution of the same code between two processors with the same architecture. For comparing different programs on the same architecture, you need a different tool that focuses on total execution time.


I read it and I still don't get it, can someone (re-)explain what the presence of the print() is doing that is helpful for branch prediction (or any other aspect of the CPU)?

Update: It seems to be the conditional move, see https://news.ycombinator.com/item?id=37245325


I read it three or four times. It's never explained. If the print("lol") version has a branch-less-than, what does the regular version have? It must either be a branch or a conditional move, but we aren't shown that part of the assembly. You can't reach any conclusions about why one version is faster if you don't know what you're comparing to.


The high-level view is that adding code that never gets executed causes the compiler to emit code that the CPU predicts better. IDK if this is the compiler assuming that the `print()` call is cold or the branch predictor getting luckier by chance but basically this tickles the CPU in the right way to get better performance.

It seems that this is mostly luck in a strange situation. And of course if you ever hit the `print()` it will be way slower than not. You can probably do better by adding something like a `__builtin_expect(...)` intrinsic in the right place to be more explicit about what the goal is here.


I'm in school, so this may be oversimplified, but if the processor/assembly code is predicting the next result, it gets the result faster. The processor only does this prediction with conditional branches. The extra if for printing or finding the min invoke the prediction with the accuracies stated.


No, this is not true.

There is branch prediction around the length of loops. This is a case where the processor is not able to accurately predict how long it needs to stay in the loop. The BLT instruction changes the prediction model, causing the processor to be more likely to assume the loop will continue.

Honestly, though, worrying about this level of optimization is generally silly. If you're looping through an array often enough that optimizing the code this way is worth your time, you should use a data structure that automatically maintains the max (and min) values for fast retrieval.


> The processor only does this prediction with conditional branches

This sounds... wrong? Unless ARM64 is designed in an absurd way?

I'd love to see the full disassembly; something seems funny here. If it was x86 I would say it's a conditional move causing this, but I don't know what's going on on ARM.


It's interesting that you say conditional move here.

I am confused by this behaviour, and although I definitely don't know what the answer is here; the non-lol version does have a CSEL (https://developer.arm.com/documentation/dui0802/b/CSEL) which is totally missing from the lol version.

Non-lol https://godbolt.org/z/ds1raTYc9

lol https://godbolt.org/z/c3afrb6bG


Ah there you go, that's the conditional move on ARM. Yeah, those are slow.


"Conditional move is slow" is not the right takeaway. (In fact, conditional move can be much /faster/ than a branch in certain circumstances (e.g. hard-to-predict branches, like binary search).)

The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)

However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!

On the other hand, conditional moves would be much faster for a loop like:

  bigs := make([]int, len(numsA))
  for idx, v : range numsA {
    if numsA[idx] > numsB[idx] {
      bigs[idx] = numsA[idx]
    } else {
      bigs[idx] = numsB[idx]
    }
  }
There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)


But why does the print statement matter in how the compiler decides this? In both cases, the execution still depends on the vMax value calculated in the previous iteration.


The version with the print statement requires the branch because it's conditionally executed; the 'continue' statement in the if-block will skip it if the loop finds a new maximum.

Since the function-with-if version requires a branch, there's no advantage (regardless of predictability) in using a conditional move for the assignment to maximum.


So basically the print prevents an optimization, that so happens to hinder performance in this particular case?


How exactly does that change anything? It's being run on an increasing array, so the data is assigned every loop. Surely the comparison creates a dependency on the value of the variable which was assigned in the last loop.


For branch instructions but not for conditional move instructions, the processor will happily execute the next instruction before the values are actually ready, then invalidate the work it has done later if it turns out retrospectively to have been wrong. This is called branch prediction.


Technically my understanding of terminology is that the branch predictor is responsible for determining which branch to take when speculative execution encounters a branch. Speculative execution is the thing that will invalidate work if the branch predictor was wrong.

Also my understanding based on reading this thread is that a conditional move is harder to speculate across because of the data dependency carried across iteration loops, preventing speculation. With a conditional branch, speculation can keep executing the next loop and most of the time the branch predictor will guess correctly how to execute the next iteration for this problem.


I wouldn't say it's harder to speculate the result of a conditional move, more that processors don't do it. Or even stronger, that a conditional move instruction is intentionally opting out of prediction for one reason or another.


For a strictly increasing array, assuming the branch predictor always predicts correctly (guesses taken), the CPU does not have to wait for the test to resolve (v > maxV) before it executes the assignment (maxV = v). So effectively the CPU is just running

  maxV = v
  maxV = v
  maxV = v
  maxV = v
over and over (with tests running in parallel as well, but those matter less, because they are mostly just "checking" whether the branch predictor guessed correctly.).

Then, in that above sequence, there is no /read/ of maxV after each write, so there is no true data dependency. (We don't need to wait for the last write to finish before we write again, unlike the read after write case.)


So would you say in the general case, i.e. the compiler obviously doesn't know how predictable a branch may be, using conditional move is a poor choice when writing to an address which is repeatedly written to inside the loop?

What I am trying to get a feel for, is whether this represents a bad choice of assembly in general for this kind of loop.


Nit: If a variable/object is only being written, that’s fine. There is a data dependency here because there is a read after a write.

I don’t have enough compiler knowledge to be able to definitively answer your question, but I think it would be a reasonable heuristic to favor emitting a branch over a cmov in cases where the cmov participates in a loop-carried data dependency.

The other advantage of emitting a branch is that branch prediction can better adapt to runtime conditions. Branch prediction tends to struggle when data is perfectly random, but most data is not. So in general it’s also reasonable for compilers to default to emitting branches.


Yeah, after reading the blog post I felt reasonably confident that this is a compiler bug (in terms of perf).

Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).


I don't know if I would classify it as a bug, but definitely a heuristic that could use tuning.

At least (at a high level) in LLVM, branches and cmovs are represented with the exact same construct, and one of the codegen passes looks at the condition and the two sides and heuristically determines whether to emit a branch or a cmov.

I don't know how Go codegen works, but I assume they do something similar.


Confused, why do you think this is a compiler bug? You should definitely expect a predictable branch to be faster than a conditional move, and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).


> and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).

Would you mind expanding? If the conditional move isn't needed, and given that it's costly in terms of perfs, how is that “totally sensible” to have one here?


The performance of branches is data-dependent. The performance of conditional moves is data-independent. In this case the branches are predictable, so they perform better here. In general, though, the compiler has no idea whether that's the case, so it makes sense to insert a conditional move to avoid branch misprediction penalties.


But still, why the difference? With or without the "lol", the data dependence is identical, the predictability is identical. Yet with the "lol" it optimises differently.

My best guess is that it's because of the added cost to the other branch: printing is expensive, so the compiler really doesn't want to do that, and would prefer to incorrectly predict not-printing than to incorrectly predict printing.

If that tradeoff is what causes the difference in behaviour here, then I understand. But I don't like it, because it's ridiculous and we shouldn't have to con the compiler into doing what's right. If we're going to have to do this, I'd rather have the language provide a way for us to make explicit what we expect here.


When compiling code like

  if a > b
    b = a
    continue
  else
    continue
you can replace the entire if-else with a cmov. but if there is a function call in one of the prongs and not in the other, you cannot.


Ah, so it's the function call (which could just as easily have been something other than print), that forces the compiler to use a normally-less-optimal branch rather than the normally-more-optimal cmov. Only in this particular use case, cmov is actually less optimal, but the compiler doesn't know that.


This is a very good answer. Thanks.



Thank you.


As to branch prediction, for anyone interested: https://stackoverflow.com/a/11227902


Processor "optimizations" can produce surprising effects. The problem is these optimizations are not programmatically accessible to C (or most modern programming languages) given their simple memory model. Deterministic performance is not easy to obtain. My view is to not bother with such tricks unless absolutely necessary (and be prepared that your changes may actually pessimize performance on a future processor or a compatible processor by a different vendor).

If you are interested in this sort of thing, check out comp.arch!


I've been researching this pretty deeply for the last few years, and I've come to the conclusion that, without a complete redesign, most popular programming languages cannot have direct control of these optimizations in an ergonomic manner.

The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.

To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.

I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:

https://github.com/trailofbits/vast

I'd definitely like to see the compilation processes be more transparent and easy to work with.


I would agree, but it's hard to argue with a factor 2 performance boost.

But these kind of tricks feel like we need to con the compiler into optimising this correctly, which is of course ridiculous. What we probably need instead is if-statements that we can tell what's most likely the correct prediction.

Something like:

  if v > maxV predict true
    maxV = v
    continue


It's only factor 2 with an increasing array though. At which point you can just take the last element, that's way faster.

So really you end up having to make assumptions about the input to get the performance boost.


The text does point out that the branch is somewhat predictable even for a random array. In that case, the odds of having seen the array-maximum increase as you scan through the array. For example, on a random array I would predict the first iteration to take the branch 1/2 of the time, but the last iteration to take the branch only 1/N of the time.

The CPU's branch predictor won't be able to perform that kind of algorithmic analysis, but patterns like the above also work reasonably well for simpler heuristics like 'predict the same outcome as the last time the branch was taken'.


In the Linux kernel, there are unlikely() and likely() macros which indicate to the compiler whether or not a condition is likely using __builtin_expect (which then influences the output assembly into producing code that should make the branch predictor do the right thing more of the time).

Unfortunately, the issue here is that the performance depends on the input and so such hints wouldn't help (unless you knew a-priori you were dealing with mostly-sorted data). Presumably the min-max (and lol) versions perform worse for descending arrays?


It's nice using these to mark less-likely, but latency sensitive, paths, which is something that profiler guided optimization can't do.


No explanation whatsoever. Why the branch predictor is not "invoked" in the first version of the function?


Because it's most likely using a conditional move.


But I don't see the post going into investigating this at all. Yes, most likely that is what is going on but I don't understand the point of the OP post is if the real reason of the difference in the branch predictor behavior is not explained.


Why would an unconditional print have any effect on whether the branch predictor is invoked or not? The if statement is there in both cases, so branch prediction should kick in for both. I didn't find an explanation for this behaviour in the article.


The go compiler might have a heuristic where a branch with IO is considered cold compared to a branch without.

In the original, it essentially faces

    If <cond>:
        Mov
    Else:
        Noop
Considering these branches unpredictable, it generates a CMOV.

With

    If <cond>:
        Mov
    Else:
        Print
It now considers the first branch hot and the second cold, and thus branch predication valuable, and generates a branch instead.

Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable, so all the conditional move does is create an unnecessary data dependency.

It’s not necessarily the wrong choice in general, as for unpredictable branches cmov will usually be a huge gain, they’ll incur a cycle or two of latency but save from 15+ cycles of penalty on a mispredicted branch (which if the prediction only works half the time is an average 7.5 cycles per iteration).

You can find older posts which demonstrate that side of the coin e.g. https://owen.cafe/posts/six-times-faster-than-c/


> Turns out for the use case choice (1) is a misfiring, as the branch is extremely predictable

Happens to be extremely predictable for this data. In general, over all possible inputs, it’s not extremely predictable.

If you assume all inputs are different (not something the compiler can assume, of course) the probability of having to update the max value goes down from 1 for the first iteration to 1/n for the last, so, possibly, the loop should be split into two halves. Go through the start of the sequence assuming the value needs updating more often than not, and switch to one where it doesn’t at some point.

For truly large inputs you could even add heuristics looking at how much room there is above the current maximum (in the limit, if you’ve found MAX_INT, you don’t have to look further)

Sorting programs used to have all kinds of such heuristics (and/or command line arguments) trying to detect whether input data already is mostly sorted/mostly reverse sorted, how many different keys there are, etc. to attempt avoiding hitting worst case behavior, but I think that’s somewhat of a lost art


Thank you. I was struggling to understand what was going on after reading the article. Your explanation is much better.


It has to make a decision to jump over the print sequence or not, whereas without the print it can eliminate the branch with a conditional move and only have to predict the loop variable. It just makes a bad guess as to whether branch prediction or conditional move will be faster, as explained above by tylerhou.


I'm a noob. Looking at the disasm: https://godbolt.org/z/766aPTPc3 It turns a CMOVQLT to a JLT. Is the blog saying CMOVQLT don't have branch predication? I don't get it.


Your disasm is for x86-64. The benchmarks in the blog were run on an M1 MacBook Pro, which is an ARM64.


Sorry. My bad. But looking at ARM64 https://godbolt.org/z/YEjGKce1Y The difference is CSEL and BLT. The question still stands. Does CSEL have no branch predication?


It appears so. https://developer.arm.com/documentation/102374/0101/Program-...

"So far, we have seen examples that use branches to handle decisions. The A64 instruction set also provides conditional select instructions. In many cases, these instructions can be used as an alternative to branches."

Seems CSEL is not a branch.


CSEL is a conditional move, they’re usually used to avoid branches (and branch predication) entirely.

Here it turns out to be a very bad choice, because it creates unnecessary data dependencies. But a cmov would be a fine choice if the branch was impossible to predict (e.g. if the input was random, well even then I’d expect the limit to mostly creep up so it should be predicated as mostly cold).


Why is there a "continue" at all in the first code sample?

Edit to add: does removing it make any difference?


The inclusion of ”continue” in the non-lol version is pointless and obscures the actual reason for the difference: the addition of the non-pointless ”continue” in the lol version.

As other comments point out, this construct can be replaced by a cmov instruction:

  if a > b:
    b = a
The following construct however, cannot be replaced by cmov:

  if a > b:
    b = a
    continue
Only by first eliminating the pointless "continue" is this replacement valid. But by including it, you can make it look like it's the 'print("lol")' is what makes the difference, which is only true lexically.


According to the godot compiler explorer removing the `continue` makes no difference to the generated assembly.

https://godbolt.org/z/ds1raTYc9

https://godbolt.org/z/rbWsxM83b

The `print("lol")` output looks remarkably different.

https://godbolt.org/z/c3afrb6bG


Good question. As you can see in the comment in the github repo, it has no effect. https://github.com/ludi317/max/blob/master/blog/max_test.go#...

It is there only to match the continue in the second code sample, where it is needed.


Thanks! In that case, I have to say I'm surprised. I assumed the code generated for the loop would have an instructions that branches, so adding another branching instruction could only hurt (edit: not necessarily a lot), but apparently my intuition is wrong.

I'm curious if the performance difference noted in the article happens on Intel/AMD as well...


unrelated: a few months ago you were asking about small language runtimes https://docs.micropython.org/en/latest/


It is there to skip the print("lol") in the second version if the condition is true. Since the array is sorted in ascending order, it will be true for every value, and that print is never be executed.


I was curious what this strange assembly language was, as it looked like neither Arm nor x86.

Apparently the Go toolchain has its own assembly language which partially abstracts away some architectural differences: https://go.dev/doc/asm

I wonder what the advantages are? It feels like as soon as you move away from the basics, the architecture-specific differences will negate most usefulness of the abstraction.


I guess it's for historical reasons. As the document you linked states, "The assembler is based on the input style of the Plan 9 assemblers". It's important to know that at least two of the "founding fathers" of Go (Rob Pike and Ken Thompson) are ex-Bell Labs guys and were involved with Plan 9. The Plan 9 compiler toolchain was available, they were familiar with it, so that's what they used for Go. Some parts of the toolchain (the linker, I think) have been swapped out in the meantime, but the assembly format has stayed.

EDIT: found the document talking about changing the linker: https://docs.google.com/document/d/1D13QhciikbdLtaI67U6Ble5d... . Favorite quote:

> The original linker was also simpler than it is now and its implementation fit in one Turing award winner’s head, so there’s little abstraction or modularity. Unfortunately, as the linker grew and evolved, it retained its lack of structure, and our sole Turing award winner retired.

...which is referring to Ken Thompson I guess.


It's not just historical, it's more "the same justification as back then".


Rob Pike's talk The Design of the Go Assembler from GopherCon 2016: https://www.youtube.com/watch?v=KINIAgRpkDA


The explanation is not convincing.

My guess is some kind of measurement error or one of the "load bearing nop" phenomena. By that I mean the alignment of instructions (esp branch targets?) can dramatically affect performance and compilers apparently have rather simplistic models for this or don't consider it at all.


Does Go have any facility for providing hints to the optimiser (like how some C compilers support #pragmas) that could cause the branch-predicted instruction to be used rather than the slower one?


Seems like the answer is no[1] and profile-guided optimization is recommended instead, https://go.dev/doc/pgo. I would be curious to see if pgo helps with the author's use case.

[1] https://groups.google.com/g/golang-nuts/c/1erdKe3aV5k


Ah thanks! That's interesting but a bit weird to me. That response sounds a little bit like someone who feels like they shouldn't do something and is thinking on-the-fly for reasons they can use to justify that feeling.

> We don't want to complicate the language

So I can understand if this complicates the implementation but I don't know if totally optional pragmas or annotations complicates the language itself. Like C has this but I don't think people say "Ah C is alright but the pragmas are a bit confusing and make things complicated".

> experience shows that programmers are often mistaken as to whether branches are likely or not

Your average programmer may mess that up, but those who would give optimisation hints aren't quite your average programmer. And insisting on introducing PGO to your build process (so build, run-with-profile, rebuild-with-profile) for some cases where someone isn't mistaken as to whether branches are likely (or whether some loops run minimum X times, etc) feels a bit needless.

Please remember though that I'm neither a Go programmer nor contributor so I'm really just an outsider looking in, it could be that this is a total non-issue or is really low-priority.


Not giving you nice things because "Your average programmer may mess that up" is the whole philosophy of Go though


Yeah I know they don't tend to want to just hand you over low-level access, like an `asm(...)` statement in C (or maybe like instrinsics). But pragmas tend to be a bit less explicitly "mess it up". Like in the case we're talking about, a loop performs slightly worse by default and it might be nice to nudge the compiler into the right direction. Just as go isn't "messing [it] up" by selecting a slightly suboptimal instruction here, you're not "messing [it] up" if you annotate a loop slightly incorrectly.


Go can have unexpected performance differences way higher up in the stack.

Ask me about that one time I optimized code that was deadlocking because the Go compiler managed to not insert a Gosched[1] call into a loop transforming data that took ~30 minutes or so. The solution could've been to call Gosched, but optimizing the loop to a few seconds turned out to be easier.

I assume the inverse - the go compiler adding too many Goscheds - can happen too. It's not that expensive - testing a condition - but if you do that a few million times, things add up.

[1]: https://pkg.go.dev/runtime#Gosched


The Go scheduler is now (well, for years) preemptive.


There’s a go course that was really good about this level of nuance. He talks a lot about mechanical sympathy and how to dig in detail with this. I think it’s called ultimate go?


I tried to come up with the most efficient implementation of this rather simple function that I could think of with pure Go without going down to SIMD Assembly: https://go.dev/play/p/zHFxwvWOoeT

-32.31% geomean across the different tests looks rather great. Any ideas how to make it even faster?


Most languages have a `max` function, so the core of the loop could be written with just something like: `maxV = max(maxV, v)`

That could be entirely branchless, right?


A max function still compiles down to some kind of branch


I thought there were specific assembly instructions for this kind of thing, such as MAXSS in x86 [1], plus vector variants like SSE4 PMAXSD. Presumably it's possible the CPU can handle those with special branchless logic, depending on the compiler and CPU implementation. I guess you'd have to know about the CPU internals to know if the instruction is truly branchless, but it is branchless in the sense there is no conditional jump made in the assembly instructions.

[1] https://stackoverflow.com/questions/40196817/what-is-the-ins...


clang compiles all three of these functions to use max instructions (https://godbolt.org/z/9z7hGfdhq).

    #include <algorithm>
    
    using std::max;
    
    int max_array_func(int values[], size_t values_count)
    {
        int max_value = values[0];
        for (size_t j = 0; j < values_count; j++)
        {
            max_value = max(max_value, values[j]);
        }
        return max_value;
    }
    
    int max_array_bittwiddling(int values[], size_t values_count)
    {
        int max_value = values[0];
        for (size_t j = 0; j < values_count; j++)
        {
            int x = max_value;
            int y = values[j];
            // http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
            max_value = x ^ ((x ^ y) & -(x < y));
        }
        return max_value;
    }
    
    int max_array_branch(int values[], size_t values_count)
    {
        int max_value = values[0];
        for (size_t j = 0; j < values_count; j++)
        {
            if (max_value > values[j])
            {
                max_value = values[j];
            }
        }
        return max_value;
    }


Kind of tangential, but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica? I've only ever meet people like that maybe two or thee times in my career, and yet it seems like every other blog post I read in certain language circles everyone is some kind of ASM and Reverse Engineering expert.


They're a dying breed. We're forgetting how to look under the hood and understand "why something works".

Case in point, I'm slowly being replaced by Salesforce muppets for all my projects at work. They're little code monkeys with amazon ebook type knowledge, projects cost 20x more and I look like the mad scientist for speaking the truth. The products are worse in every possible metrics, I'm not crazy. The politics at play is the reason why I'm losing ground, not logic.

Cabinet designers are being replaced by Ikea flat pack artists in the software world. All we can do is stand by and watch.

And in regards to this blog, when Medium eventually go, that knowledge will go too. Blogs have died, personal websites as well, and their ability to be found in Google is almost non-existent.

Sorry I don't have anything more positive to add, except maybe that they're still there, slowly being alienated by the modern tech world!


> We're forgetting how to look under the hood and understand "why something works".

Partly because that's often not what we're supposed to do; the stuff under the hood "just works" and we're meant to use it to write features, not worry about optimising the stuff that happens under the hood.

And partly it's because the stuff under the hood is increasingly weird and bizarre. Branch prediction is weird, and I still don't understand why that extra print statement changes the branch prediction. Why does it predict `v > maxV` is true when the alternative is to print something, but it doesn't predict that when the alternative is to do nothing?

Is it because printing is expensive, and therefore the branch predictor is going to strongly prefer avoiding that? It's weird that we'd basically have to deceive our code into compiling into a more performant form.

I don't want to have to second guess the compiler.


In my world "just works" means "we blew away all of the security controls and best practice to get this thing hobbling across the finish line."

I see COTS products using ldap memberof queries without LDAP_MATCHING_RULE_IN_CHAIN and stating definitively in their documentation that nested groups are bad (despite decades of best practice).

I see product documentation recommending authenticating against LDAP instead of kerberos, despite the underlying libraries having full kerberos support.

I see sslverify: no, and flags to ignore SSH TOFU warnings, and recommendations to avoid SSH gssapi-keyex (WHY?????), and security approached by buying ever more products creating ever more complexity.

Yes, things "just work" in a horrible, 'youre stuck with your vendors forever' sort of way that results in lengthy outages every 6 months due to mounting, intractable technical debt. But things don't have to be this way, you just need people who are willing to ask "why" or "is that necessary" or "can it be better".


> I don't want to have to second guess the compiler.

No one wants to have to second guess the compiler. And it's almost never the compiler, until it is. As a grad student, I remember someone else in our lab spending a long time on a particular issue until he tracked it down to a not yet reported bug in gcc.

99% of the time you can rely on the compiler (or the language runtime, or popular library X), but we also need to be able to identify the rare cases where the lower levels of the stack are broken and how to either fix or work around them.


> not worry about optimising the stuff that happens under the hood.

no one is asking you to do this optimization all the time. But you need to know what and how, if you were to be tasked. Then as you get more senior, you will have to also be able to judge whether it is possible, or worth the effort to devote to the optimization, should a scenario arises.

> second guess the compiler.

you aint second guessing it, you're examining the output, and understanding it. Then potentially recognizing when or if the compiler is outputting bad stuff. That's not second guessing.


> I still don't understand why that extra print statement changes the branch prediction.

The article is under-informing you: they should have shown the full disassembly for both loops, not just the bit in question. I suspect the explanation as given is slightly wrong, but there's no way to disprove it without replicating the setup myself.


> We're forgetting how to look under the hood and understand "why something works".

But under the hood there is a hood. And under that hood there is another hood. And under that hood there is a brand new car you don't know how to open the hood, and so on.

I cannot devote my life to know everything or I won't be able to provide for my family.


If you don't understand "why something works" then you will certainly not be able to answer "why it doesn't work".

Every high level library is an abstraction and abstractions are leaky.

I'm not saying you have to be able to write assembly by hand. I'm saying you should understand why your code runs 10x slower in docker.


I'm interested in what's under the hood but after working full time scrubbing the bonnet, my motivation is mostly gone.


You don't need to know every little detail, but having at least a basic understanding how the machinery works (and keeping that knowledge somewhat uptodate), and also being able to verify that the machine actually does what you expect it to do is useful for all programmers, because it lets you write code that works in harmony with the hardware instead of fighting it.

(besides, learning to read a bit of assembly code never hurt anyone)


"It's hoods, all the way down..."


No. Flat out wrong. There is only a small amount of fundamental understanding that needs to be acquired to actually know what your code is actually doing.

Knowing how things work fundamentally is what matters the most, as everything else can be inferred and deduced from that. This is universally true. When you know how something works, you can build up on that.

There's no value for a programmer below how processors work with registers, caches and bus accesses. The most fundamental interface between the programmer and the computer are the registers, caches and interrupts, which make everything else happen.

Understanding how these are being used practically and properly covers a LOT of ground to build up upon.

On the other side you have the people who are fully dependent on compilers to do their job for them. They don't write code for the machine, they write code for the tool.

They don't actually know what they're doing, they have no idea how it works and they believe that the compiler optimizes it all properly anyway, as if it's magic, with no fucking clue what the fuck they're actually doing, while patting themselves on the back for living in dependency.

Why do programmers need garbage collection? Because people have no idea how not to need it. Why are pointers considered bad? Because some assholes decided that telling everyone they're too complex to use was a smart idea instead of teaching them properly, which, of course, it fucking wasn't.

If people would actually know how to do proper memory management, we'd had far less security issues from the get go instead of requiring patches for software that was written in a stupid way, ignorant of memory mapping, pages, etc ... you know, fundamental knowledge and understanding of what's actually happening.

It is considered brilliant to use mmap for faster, more memory efficient file access ... which is absolutely ridiculous, because it exposes how little programmers actually know about the fundamentals!

Using mmap for sharing information between processes and reading and writing files has been my go-to since forever, simply because every other way is literally doing it wrong.

Anyone saying that not every programmer needs to know all of this is missing the point. Of course you don't need to, but the status quo is built upon exactly that line of thinking.

Are you better off being able to, at least moderately, repair your car yourself, or are you better off depending on someone else to do it? The answer is obvious, and the same applies to literally everything.

Less dependency should always be the top priority, because less dependency leads to better understanding and increased freedom ... but we're living in a world that's doing the exact opposite.

Gonna stop writing now, because there's no end to this.


I never had the chance to work at the lower levels you seem to have worked with; in any case I don't disagree in principle with your points (and understand some part of the fundamental understanding you mention), but less dependency leads to wheel reinventing which is completely unsustainable in the long run, specially if you need to maximize the value you bring to the table. It has been like this for every field; how can we call this, industrialization?


> Cabinet designers are being replaced by Ikea flat pack artists in the software world

That analogy doesn't really work if the new projects "cost 20x more" upfront, unless you mean long-term associated costs.


It’s what evolution looks like.

My grouchy mother in law also laments that they teach typing in school and not handwriting.

Lamentable? I guess. But it’s not realistic to hope people just learn more every generation.

Despite the tradition of “kids these days with their frameworks and libraries!” complaining, you want to see this evolution.


> They're a dying breed. We're forgetting how to look under the hood and understand "why something works".

Well, layering abstraction is the most effective way to deal with complexity. Our brain is too limited to know everything.

The people, who understand "under the hood", probably won't know much about what's "above the hood", like writing a website or mobile app. Therefore, we need experts on every layers.


This feels semi-normal to me... just have the curiosity to ask "why?" and the bias-to-action to move to "I'm going to find out".

You encounter far far more dead-ends than anyone ever says, and every unsolved mystery is a mild nerd snipe, an open case, that years from now you'll see someone else explain something you realise it answers that question from years prior.

For me, the hard bit is not over-indexing on this... you learn things, but biasing too much for them is a sure fire way to over-engineer or increase complexity to the point where something is now worse for you knowing something. But once in a while that tiny thing you learned years before is a 20% savings across the board with associated performance increase and everyone wondering how on Earth you could possibly have made those jumps.

Also related... incidents. "Why" and "I'm going to find out" is the best way these things don't recur in future. A high degree of observation and understanding is a happy engineer life as it can improve what can often be the most stressful parts of the work (on-call, etc).

That XKCD comic about everyone learning something for the first time factors too... there is stuff you know that others do not, share it.


I remember someone saying the difference between Physics and Computer Science is that in CS we are the masters of the universe - there are no laws of Physics that bind us.

For me that means that in our world of computers there is infinite curiosities to discover. (Not that the same isn't true for the natural world too)


> "Why" and "I'm going to find out"

And it drives me nuts dealing with people who don't think this way. I'm not a jerk about it, but my personality is 'lets what all we can figure out the why'.


https://imgs.xkcd.com/comics/ten_thousand.png for those who haven't heard of it.


I always felt this one was a bit smug when applied to occupations (which it often is). People paid to do a job should know the basics of that occupation.


Canonical link: <https://xkcd.com/1053/>


My first machine was a TRS-80. My first large program was a compiler from TRS-80 BASIC to Z80. I subsequently disassembled the ROM in the machine to figure out how things worked.

These skills stay with you, and if you read articles like this then you can keep broadly up-to-date with the insanity that is current CPUs. Things like pipe-lines, branch prediction, and different levels of cacheing are optimisations that you can acquire as you go.

If you're an auto-didact web developer then you never have the opportunity to learn these skills, or the need to do so.

I know a lot of people who are comfortable with doing this, but in my case it's a generational thing. If you want to do it then you can. It's not hard, it's just a different skill from those you already have, though sufficiently related that you wouldn't be starting from scratch.

But starting with modern CPUs can be hard. Learning the basics from older, simpler CPUs can help. Doing some kind of embedded programming might be the way to get started, or working on an emulator.

As always, YMMV.


It's odd to me to see people who don't want to understand the next level from any current level. A front end person who doesn't care how the backend works, or a backend person who doesn't care how the database works, etc... Obviously there is only so much time in the day, but being comfortable digging down to whatever level is necessary to debug something is a valuable skill.

Not sure if it's still asked, but reminds me of a class interview question. "When a user presses a button on a web page, describe in as much detail as possible what happens."


A compiler from what to Z80? (I think there’s a word missing in what you wrote.) I started on a Model I and I’m curious about your compiler. I miss some aspects of those old machines. It was great, as a learner, that the ROM and a DOS like L-DOS was small enough to fit in your head. It can be intimidating how large everything is today.


Now edited ... thank you.

The compiler was from a limited subset of TRS-80 BASIC to Z80. It was written in its own subset, so it could compile itself.

I had 16K of RAM, and that had to fit the BASIC source, the compiled version, and then that compiled version compiled the BASIC source to another compiled version in a different location, which could then be saved on tape.

Interesting times, and so foreign to today's experiences that it's hard for people to follow the details. I could write it up, but I'm pretty sure no one would derive any value from reading it. I sometimes wonder if the hand-written syntax flow charts and first draft are still somewhere in my piles of undiscarded papers. I doubt it ... it as 1979 when I was doing it, and I've moved continents since then.


That sounds like a valuable exercise. I wrote a word processor. There was an issue of 80 Micro that had a simple buffer in assembly (I think they called it Scripy) and I wrote some BASIC around it for various other functions, so that I could keep a journal and write essays for school. A compiler sounds ambitious in comparison.


I think you'll find that the deeper you go into "traditional" computer scientists, the more you'll find the problem-solvers, hackers and tinkerers that post these types of blogs. Especially in odd cases where a random print statement doubles your profiled performance.

That being said, of all the people at all of the tech companies I've worked at, maybe ~5% of them had this sort of mentality and drive to execute on it.


I don't think I'd call "branch prediction" as "low level esoterica". It is a basic fact about how CPUs are implemented since many decades now. I learnt these things in my university coursework. Any module on CPU or computer system architecture is going to teach you all this stuff. But I'm sure you could learn these things from books on this topic too.


I didn't, and frankly, half of the articles I read about it make me think branch prediction is a bug. I mean, I know it's meant to improve performance, which is great, but it has to make assumptions about what's going to happen before it knows it, and those assumptions are going to be wrong. How wrong? How can we con it into making better assumptions? Suddenly programming becomes about second guessing the compiler.

And remember Spectre and Meltdown? Security vulnerabilities caused by branch prediction. If I recall correctly, the pipeline was executing code it wasn't meant to execute because it's executing it before it knows the result of the check that decides if it has to execute it.

Programming is a lot easier if the actual control flow is as linear as I'm writing it.

My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days. I feel like I can't trust them anymore.


Without branch prediction and speculative execution, performance would always be as bad as with a misprediction. So every correct prediction is a win, and that’s why it’s being done.

> My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days.

That seems like an overreaction. For most applications, cache locality and using the right algorithms is more important, if performance is an issue at all.


> it has to make assumptions about what's going to happen before it knows it, and those assumptions are going to be wrong. How wrong

Speculative/optimistic techniques are not limited to branch prediction, you encounter various forms of it pretty much everywhere, without knowing it.

* Your hard drives have a read ahead buffer, fetching in advance future data, just because most reads are immediately followed by an other one for the data just after.

* Your CPU instructions are pre fetched, because most instructions are _not_ jumps, so you will 99% of the time just execute the instruction right after it.

* Instruction reordering where your compiler will decide that future instructions not affected by previous instructions could be run in advance.

* Overall any kind of caching is some form of optimistic technique. You are preparing future results.

If you think about it, optimistic/speculative techniques are ubiquitous, and used even at _high_ abstraction levels.

The famous python mantra of "better ask for forgiveness than permission" embodies that spirit. It encourages a coding style of "try: do() except: nope()" rather than "if check: do()".

Standing "against optimistic/speculative techniques" is standing against transactions, rollbacks, caches. It's just not a viable line of thinking IMHO.


You shouldn’t be thinking about the branch predictor this much. The predictor is pretty good at guessing, unless you need to eek every last shred of performance you can ignore it


> My broad takeaway of the whole ordeal is that I'm basically avoiding if-statements these days. I feel like I can't trust them anymore.

Such broad "optimization rules" usually don't make much sense on any compiler created in the last 25 years. An optimizing compiler will analyse and optimize your program's control flow at a very high level, not just naively insert a conditional branch for each if (and that's also true in "low level" languages like C)

If you actually care about such things, you really need to look at the compiler output and incrementally tweak your code (but then other compilers or even compile options might destroy such tweaks again), there hardly is a single optimal source code solution across CPU architectures or even just different CPU models of the same architecture (just don't do any obviously stupid things which make the life of the compiler harder than need be)


....And then feel OK resorting to ChatGPT for the explanation.

Seriously that threw me, and maybe it makes sense in this context but it seems strange for someone with such an apparent depth of technical knowledge leaning on an LLM for anything.


It seems they didn’t want to bother coming up with their own explanation, but then why not just link to https://en.wikipedia.org/wiki/Branch_predictor?


pasting LLM answers is sort of a tongue-in-cheek rhetorical flourish these days


> but who are these people who are so comfortable with disassembling a high level language binary, reading assembly, and then making statements about branch prediction and other such low level esoterica?

i don't see any assembly here. the analysis is done by using a profiler. a very common tool available for most programming languages.

https://timotijhof.net/wp-content/uploads/2020_profiling_fig...


There's some bias as topics like these are often on top of HN. Extrapolate the 2 or 3 people you've met to all the programmers in the world - that's how you get some amazing in depth blog posts every other week.


There are many ways you can get to this point (e.g. I just sort of picked this kind of thing up), but an example of a course which is designed precisely to give you these skills is Casey Muratori's "Performance-Aware Programming".


These people genuinely care about their craft to the point they ask: what happened. They then look into it.

People who tend to care tend to talk about what they care about. So, you wind up getting a lot of blogs who sort of self select themselves to do this kind of stuff. And everyone feeds off that energy and gets better :)


There are many engineers that can do this, it's just that writing javascript pays 3 times or more so the focus goes there :)

If you visit #c / #asm on any popular IRC network, you'll find a lot of skilled people that can do this routinely.


This topic among others was covered by a Computer Architecture class back in my university. That class is mandatory for all CS students. Though I have to add that it's never been brought up after I started to work.


I think it depends on what niche you're in. I'm in the storage business and while not everyone, but quite a few of the people in my group would not be afraid of running a disaseembler, or talk about branch prediction, or unaligned integer access. `likely` and `unlikely` are very common in the code, and so on. AFAIK linux kernel people are very similar, and I assume a few other niches of system programming as well.


The HN bubble doesn't have only disadvantages.


It’s not all that uncommon outside of web stuff


People who maintain a blog and whose blog posts are widely circulated are already highly selected. Conversely, people not comfortable with that stuff will also tend to be less comfortable writing a technical blog.

Also, if you read the other subthreads, there are a number of points to be criticized about this writeup.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: