Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Getting 4 bytes or a full cache line: same speed or not? (lemire.me)
39 points by ingve on Aug 1, 2018 | hide | past | favorite | 18 comments


I don’t like the article.

The code is not portable to other compilers/OSes so I can’t even run it here on Windows. The code uses inline assembly a lot, i.e. optimizer likely failed to achieve much, and for code like this optimizations are very important.

The reason for inline assembly is also worse mentioning, it’s for rdtsc instruction. Mainstream CPUs have dynamic frequency for the last decade. Recent CPUs return rdtsc values unrelated to the count of instruction it executed, it’s a counter running at some constant frequency. The code uses AVX, the CPU is therefore at least sandy bridge or AMD equivalent. Now in 2018, __asm rdtsc is no longer a good method for performance measures.


> The code uses inline assembly a lot, i.e. optimizer likely failed to achieve much, and for code like this optimizations are very important.

Yes, optimizations are important, but not for the question the autor tries to ask. To answer his question he need to control optimizations, to have vectorized code and not vectorized one. The simpliest way to control optimizations is to turn them off and to do them by hand.

> Recent CPUs return rdtsc values unrelated to the count of instruction it executed, it’s a counter running at some constant frequency.

So this counter is the best fit to measure speed of code. What the point to count number of instructions executed when different instructions take different time to execute? But if rdtsc counts clock ticks, than it fits best. It is OS independant way to measure time. It is compiler dependant way, but it works with gcc and clang, the compilers you can get for free. Alternatively it is not so hard to rewrite two non-empty __asm statements into any asm syntax your compiler understands. There are total of 6 instructions.


> it works with gcc and clang, the compilers you can get for free

When I’m building for Windows, I usually use Visual C++ 2017 community edition, the compiler I got for free.

> into any asm syntax your compiler understands

When building 64 bit binaries, my compiler doesn’t understand any syntax. The portable way to produce that instruction is __rdtsc() intrinsic, it’s supported in all major compilers just in different headers, x86intrin.h for gcc/clang, intrin.h for msvc.

> There are total of 6 instructions.

Right, but I’m not sure assembly, inline or not, has good compatibility with advanced optimization techniques implemented by all modern compilers. It’s just too low level: harder to inline, impossible to move stuff across registers…


> When I’m building for Windows, I usually use Visual C++ 2017 community edition, the compiler I got for free.

Is it an obstacle to get one more free compiler?

> When building 64 bit binaries, my compiler doesn’t understand any syntax.

Is it really? I'm surprised.

> The portable way to produce that instruction is __rdtsc() intrinsic, it’s supported in all major compilers just in different headers, x86intrin.h for gcc/clang, intrin.h for msvc.

I like to learn it. Thank you.

> Right, but I’m not sure assembly, inline or not, has good compatibility with advanced optimization techniques implemented by all modern compilers. It’s just too low level: harder to inline, impossible to move stuff across registers…

Its not relevant: this code do not rely on compiler optimizations.


> Is it an obstacle to get one more free compiler?

I have both gcc and clang installed but they’re for Linux. I’m using Windows as my main OS, and prefer running native binaries, e.g. because tools for debugging & profiling.

> Its not relevant: this code do not rely on compiler optimizations.

Right, and that’s GCC’s problems — MSVC-produced code is optimized, and is free from the described issue.

You know how I found that? I’ve made a version portable across compilers, built with VC++ and run on Windows: https://github.com/Const-me/MemoryAccessCosts I don’t have Intel compiler but my expectation is it’ll work even better.


> Right, and that’s GCC’s problems — MSVC-produced code is optimized, and is free from the described issue.

> You know how I found that? I’ve made a version portable across compilers, built with VC++ and run on Windows: https://github.com/Const-me/MemoryAccessCosts I don’t have Intel compiler but my expectation is it’ll work even better.

Yes! You have found that optimization doesn't allow you to make a comparison. Code is free of the described issue, so you cannot measure effect this issue have on the speed. As I understand from asm listings, you are comparing now not 1 iteration with 16 iteration, but 1 iteration with some vectorized code. In the original article this case is covered, with hand-crafted vectorized code.

Do you see? When optimization is turned on you cannot compare 16 iterations with 1 iteration or with vectorized code. It is the basics of experimental method: you need to control independant variable, and to measure dependant one. When using compiler optimizations it is the way harder, you would need to check generated code.

Turn off vectorization of code in msvc, and hopefully you'll get the same results.


> you are comparing now not 1 iteration with 16 iteration

GCC made code with zero iterations, it unrolled the complete inner loop and produced 16 scalar add instructions.

MSVC made code with two iterations, each iteration containing two paddd SSE2 instructions, which stand for _mm_add_epi32.

> Turn off vectorization of code in msvc

I can’t. When I try, the compiler says “Command line warning D9002: ignoring unknown option '/arch:IA32'”

SSE2 is a requirement for AMD64 CPUs, that’s why the MS compiler doesn’t support scalar code anymore. And the data size is too large for 32 bit builds, it’s not enough address space for a 8GB vector.


> GCC made code with zero iterations, it unrolled the complete inner loop and produced 16 scalar add instructions.

It doesn't matter really. It is amount of work that needed for experiment.

> I can’t. When I try, the compiler says “Command line warning D9002: ignoring unknown option '/arch:IA32'”

I have zero experience with msvc, but I believe you are doing it wrong. You need not to change arch, you need to control over optimizations. For example, gcc has options -ftree-vectorize, -ftree-loop-vectorize, and -fno-tree-vectorize and so on. You need to find docs on compiler optimization techniques and on options that allow to control them.

> MS compiler doesn’t support scalar code anymore.

Strange, but ok. If we return to the starting question "why the author didn't write the code portably", than we now know the answer. There are no way to write scalar code with MSVC. MSVC doesn't allow make this kind of experiments, so there are no reasons to make such tests portable to msvc.


> You need not to change arch

See the pics: http://const.me/tmp/vc/codegen.png http://const.me/tmp/vc/optimize.png

> You need to find docs on compiler optimization techniques

Indeed, #pragma loop(no_vector) in the source code before the inner loop did the trick. The results are very close to GCC.


Often you can turn off frequency scaling so you can have rdtsc still be useful.


You don't need to use rtdsc for this... You can use any library/os specific timing functions. Timing accuracy doesn't even really matter because you can just do more iterations.

Since it's outside the main loop, it shouldn't affect compiler optimizations.


I'd say if you're interested in performance to this level, it is useful to know both at&t inline asm syntax on top of the simpler microsoft inline asm syntax.


To me, the takeaway is that there is a ~10x performance difference between the near-worst case memory access pattern (330 cycles) compared to the best using SIMD (41 cycles).

A potential order of magnitude improvement hiding in the memory access patterns of your code makes this worthwhile information for a lot of devs.


Additionally, if you do the wait for a byte, you've done 90% of the wait for 63 adjacent bytes. Software becomes memory bound when it peeks at 4 bytes here, 4 bytes there in zillions of unpredictable cache lines. So, instead of waiting 41 cycles to process 64 bytes, it waits 300 cycles to process 4.

It helps to think of the * and -> operators in C++ as "int cache64BytesAndLoad4FromCache(int *p);" That's when you start to really feel "int x = a->b->c->d; a = a->next;" deep in your gut.


When 'percache' is higher, there will be more instructions executed between calls to pick_address_of_random_cache_line().

That means as the out-of-order processor "runs ahead" while waiting for the first main-memory request, it will make fewer requests to main memory before it runs out of space in its reorder buffers, and so it won't be able to overlap so many main-memory requests.

I think this is the main effect they're seeing. It would fit in with SIMD code giving better results, and it explains why the effect mostly goes away when they make each load depend on the previous calculation.


No it’s not that. The CPU runs fine, the OP was screwed up by GCC optimizer. With MSVC, there’s very small difference when testing with 8GB vectors, only 8% between 1 and 16.

Here’s my portable version, scalar-only, C++, and slightly more optimized: https://github.com/Const-me/MemoryAccessCosts

The results are in result-gcc.txt and result-msvc.txt in that repo. They’re from the same PC, I’m using gcc 5.4.0 running in WSL.


I think those results just show that if you persuade the autovectoriser to work you'll get the same result as for manual SIMD code, as you'd expect.


Initially, I didn’t know the reason is automatic vectorization. I’ve found the reason and wrote the readme after that comment: https://github.com/Const-me/MemoryAccessCosts/commits/master




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: