Too much locality for stores to forward

asdfasgasdgasdg · on Feb 2, 2020

This is a great little article. I find too many programmers don't understand the value of batches, at every level of code. Batching is super important, both to amortize fixed costs and to reduce serialized blocking for "IO" (this case, if we define IO as "operations which fetch data from a slower data source than the one upon which we're operating").

There's another mistake programmers are prone to make after learning about the importance of batching. If batches of ten are good, why not a hundred? Why not a million? Lots of people make the mistake, after learning the value of batches, of assuming that bigger is always better. This isn't the case either. There are a few reasons why too large batches can also cause problems. For streaming work, large batches hurt latencies of initial results. Access locality can also be harmed when batches are too big. If your cache can fit a hundred of your objects and you perform operations batch-wise in batches of a thousand, you have to fetch each object from main memory per batch. If you worked in units of a hundred instead, then you would fetch each object from main memory once.

I find there are a couple of good rules of thumb for batching. Batches should be as small as possible, as long as they are large enough to solve the issues I mentioned earlier. I like to target batch sizes that are large enough that the fixed costs are no more than about 5% of the total costs of executing the batch. In the case of CPU-local work like the one described in this article, you would pick batch size s.t. operation 0 and associated fetches have finished on batch element 0 before you loop back to it to begin work on operation 1. In network situations where tracking batch completion and context switching is less expensive, this doesn't make as much sense. Just make it so that multiple batches can run concurrently, and issue multiple network requests to handle each one.

magicalhippo · on Feb 2, 2020

At work I hit this issue just the other week. Customers reported performance issues. The issue was that they were pushing more data than we expected, but then again we should have expected that...

Some intricate core code had to do like 10-15 queries per invoice line. Each is quick, but it adds up when you have 10k lines. The code my coworker wrote to do this was written in a per-line way.

Now, since this code was called from other parts of our program, where it might just operate on a single line, I couldn't just cache willy-nilly inside the module. If so I'd have to introduce some calls to invalidate the caches or whatnot. Easy to miss and easy to screw up and get stale data.

However, if instead the API had been batch oriented from the start, it would still work fine on a single line, but also allow for more optimizations for actual batches. The cases where it was called for a single line at a time was when the user was punching through, so being sub-optimal for that would not be an issue.

Gonna be fun trying to disentangle this and try to make it batch friendly...

triztian · on Feb 2, 2020

Seems like a good option would be to implement a second version that has the batch API but leave the line oriented one as is for use by the callers expect it to behave in a per-line oriented way.

magicalhippo · on Feb 2, 2020

Yeah, still won't be easy to integrate the batch optimizations now, since all the code in this module is written with the single-line-at-a-time assumption.

spockz · on Feb 2, 2020

It seems to me that this would be an area where something like reactive streams excel at. The interface is by default streaming and the code dealing with the database could decide to buffer the incoming stream to an optimal amount for creating a query. (Of course only buffering until a maximum latency was reached.) This way neither the producer nor the consumer need to be aware of the exact buffer requirements and the whole can be optimised using back pressure.

magicalhippo · on Feb 2, 2020

Would be interesting to try, sadly not any library support in the language we use.

Retric · on Feb 2, 2020

That opens things up for a host of long term maintenance issues. It might still be worthwhile, but it’s worth a serious discussion at a minimum.

vhvjkyhkogvv · on Feb 2, 2020

If the single line one delegates to the batch one it should solve most of them.

hinkley · on Feb 3, 2020

starting with the innermost functions, especially those with no external caller, pluralize the function name and change the argument to be a list/array. Leave a thunking function at the old name that wraps the request and unwraps the response.

Lather rinse repeat until the entire module has been fixed, and start pushing it into the slowest callers.

ncmncm · on Feb 2, 2020

Backtrace hires thoughtful people. (Disclosure: didn't hire me. :-) It seems to come from the top: Sami al-Bahra is thoughtful.

Backtrace does what was to me a surprising service: they aggregate crash dumps and stack tracebacks, in bulk, from their customers' customers' failed programs, so that the causative bug can be identified and fixed.

dragontamer · on Feb 2, 2020

> In fact, perf showed that the query server as a whole was spending 4% of its CPU time on one instruction in that loop:

Looks like 40% to me actually. I think there's a typo here. EDIT: The tweet is more clear. 4% of the total CPU time for the whole program. The 40% in that line is only for that particular function (I presume the function is taking ~10% total CPU time, and 40% of that is on that singular instruction)

------

> The first thing to note is that instruction-level profiling tends to put the blame on the instruction following the one that triggered a sampling interrupt.

Hmm, in my experience with AMD-based hardware profiling, AMD CPUs may place the blame as far back as ~200 instructions (pathological worst-case) behind the instruction that was interrupted.

This is most obvious on AMD branch mispredictions. A branch mispredict can, by definition, only happen on branch instructions (a cmp/jmp pair, since cmp/jmp turn into a single double-instruction micro-op on both AMD and Intel). But you'll find mispredictions "blamed" on all sorts of non-branching instructions on AMD-hardware. Its a known limitation of AMD's hardware performance counters.

In the first case, the modvqu is almost certainly where the 40% of the time is being spent... later on there is...

     9.91 |       lea        (%r12,%12,4),%rax
     0.64 |       prefetcht0 (%rdx,%rax,8)
    17.04 |       cmp        %rcx,0x28(%rsp)

I can't imagine that the prefetch0 instruction is taking up 17% of the time of the function. I expect that the 17% here must be something else. In particular, Agner Fog's manuals state that PREFETCHINTA/0/1/2 has 1-latency and 2-instructions/clock throughput: a very fast instruction under Agner Fog's test case on both Intel Skylake and AMD Zen.

AMD has "instruction base sampling", a difficult to understand but more accurate profiling methodology. Intel has PEBS, precision event-based sampling, which helps narrow down the specific instruction to "blame".

This blogpost doesn't say which CPU its using, but unfortunately, these little details do matter. Ultimately, I'd investigate more before concluding the prefetcht0 instruction is actually the problem.

If Store Forwarding might be an issue, then LD_BLOCKS.STORE_FORWARD event on Intel Skylake would be a performance counter to check.

--------------

> and accelerate random accesses to the hash table with software prefetching.

Does prefetching actually help in this case? Have you tested the code without prefetching?

Without prefetching, a MODERN processor will NOT "stall" on the load instruction. A modern processor will execute out-of-order, branch predict the next iteration (its an inner loop, so its probably always taken), and then effectively achieve the same throughput anyway.

I wouldn't necessarily be "surprised" if prefetching helps in this case... but its not necessarily a given.

anarazel · on Feb 2, 2020

> Without prefetching, a MODERN processor will NOT "stall" on the load instruction. A modern processor will execute out-of-order, branch predict the next iteration (its an inner loop, so its probably always taken), and then effectively achieve the same throughput anyway.

It's a hash-table insertion. That's usually poorly predictable.

The post explicitly talks about out of order execution, and why it's of limited help here. See the second "Hypothesis" paragraph. And then talks about how to change the code to allow to utilize out of order execution.

dragontamer · on Feb 2, 2020

> The post explicitly talks about out of order execution...

I understand that. But there's no time in the post that demonstrates that prefetching was useful or not.

> It's a hash-table insertion. That's usually poorly predictable.

If he is always inserting 8 elements, the CPU (might) be able to figure out that it needs to insert 8-elements in that loop.

The hash-table insertion itself will fail to be predicted. But there's nothing the programmer can really do about that. However, by having a constant 8-buffer sized "insertion batch", you provide an easily predicted loop to the CPU. There will almost always be 8-elements inserted into the hash table, and CPUs can accurately predict constant-sized loops under 20-iterations or so... maybe 30-iterations depending on how recent your CPU is.

These 8x insertions WILL be predicted. That means your CPU will be OoO executing these 8x insertions. Each insertion may be unpredictable from a branch-prediction perspective (rehash 0x, 1x, or 2x? Depends on the hash table, probably unpredictable), but that doesn't change the dependency graph, so I expect the CPU to OoO the next iteration anyway.

--------------

In any case, I think prefetching is an open question still. I think this is a good blog post, but I'm curious if the author did any prefetching experiments.

atq2119 · on Feb 2, 2020

> Each insertion may be unpredictable from a branch-prediction perspective (rehash 0x, 1x, or 2x? Depends on the hash table, probably unpredictable), but that doesn't change the dependency graph, so I expect the CPU to OoO the next iteration anyway.

This does matter, actually. As soon as the first branch inside an individual item's insertion is mispredicted, all the following instructions will be discarded and must be re-executed. The out-of-order logic in the CPU cannot tell that there is a subset of inflight instructions that will be executed regardless of the outcome of the branch.

The re-execution is likely cheaper though, because the earlier speculative execution ensures that the relevant caches are already populated.

dragontamer · on Feb 2, 2020

Replying to myself, because it seems like a comment in the blog answers my first question.

> Hmm, in my experience with AMD-based hardware profiling, AMD CPUs may place the blame as far back as ~200 instructions (pathological worst-case) behind the instruction that was interrupted.

pkhuong in this comment: http://disq.us/p/271zzg1, suggests that he's using an Intel CPU. As such...

> For example, page 18/37 of Intel's manual volume 3B (https://www.intel.com/conte... states that when an instruction causes a PEBS assist, the PEBS record will point to the next instruction: "The return instruction pointer (RIP) reported in the PEBS record will point to the instruction after (+1) the instruction that causes the PEBS assist. The machine state reported in the PEBS record is the machine state after the instruction that causes the PEBS assist is retired."

Since PEBS is activated, we can be assured that the instruction samples are just one instruction away from the triggering instruction.

--------

The discussion includes a link to this blog post that further discusses this "skid" problem: http://sandsoftwaresound.net/perf/perf-tutorial-hot-spots/

> The combination of these two factors is called skid and it affects the attribution and distribution of samples in the final profile. In the presence of skid, samples are attributed to the general neighborhood around performance culprits such as long latency memory load operations. The cpu-clock event cannot be precisely attributed to slow instructions — just the hot code region containing the culprit.

Its important to know if you're running an AMD or Intel CPU! The hardware performance counters are implemented differently between the manufacturers! So this low-level profiling information should always have the CPU-type close by, for clarity.

BeeOnRope · on Feb 3, 2020

> Hmm, in my experience with AMD-based hardware profiling, AMD CPUs may place the blame as far back as ~200 instructions (pathological worst-case) behind the instruction that was interrupted.

> Since PEBS is activated, we can be assured that the instruction samples are just one instruction away from the triggering instruction.

I don't think PEBS is activated here (techncially, it's AMD so there is no PEBS, but let's talk about IBS or whatever the AMD equivalent is) - Paul never mentions PEBS except when quoting an Intel manual and in any case PEBS (or AMD equivalent) should put it on the right instruction, not the next one.

What's going on there is that when you want to sample "time" or any closely related thing like "cycles", the concept of skid doesn't really apply. Skid applies mostly when some other type of event is tracked by the PMU, and then interrupt occurs and it might take many cycles for the interrupt to occur.

With time though, that doesn't really matter. Sure the Linux timer fires (e.g., for task-clock based sampling) or the PMU fires (e.g., for cycles-based sampling), and then some time later the interrupt takes effect, _but this later point is also a totally valid sampling point and tends to interrupt slow instructions (or "one after" slow instructions). That is, if you are sampling every 10,000, your samples are just as good if they occur at 10,035 us or whatever, rather than 10,000: in the case that you care about sampling instructions.

This reasoning doesn't apply for other types of events, where the place the process is interrupted isn't "just as good" as the place the event fired, because the event isn't instruction/time based.

So for instruction/time sampling, PEBS doesn't provide much benefit, except by removing the 1-instruction skid ("skid" is not really the right word, IMO, it's just a consequence of the fact that CPUs generally only raise interrupts at the end of the instruction, so slow instructions complete and the next instruction is indicated).

More details at [1].

[1] https://travisdowns.github.io/blog/2019/08/20/interrupts.htm...

aratakareigen · on Feb 2, 2020

I just started reading, but why does it say "(semi)group" instead of "semigroup / monoid"? I've never heard of inversion being used to perform a fold.