> a human would know, for instance, that (s)he wants to multiply 50 times
A human would turn the for loop that counts up into a do-while loop that counts down? That's doable on the C level, and in that case GCC also generates less ceremony:
But of course if you do this you have changed the implied original spec, and you have reassociated a floating-point computation, computing different results. At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization.
> But I'll leave it in there because this handily beats the compiler in both size and speed
Not on my machine. On arrays of 1048576 (2 ^ 20) elements I didn't measure any difference between my first variant, the do-while variant and yours. On arrays of 1024 elements I still didn't measure a difference between the two compiled versions, but yours is about 1-2% slower. On 128 elements do-while looks a bit slower than for, and your variant is more than 20% slower.
Here are some numbers for n = 128, calling each function 10000000 times. dot_product is the original, dot_product_2 uses do-while, dot_product_3 is your code (with an extra move of edx to rax at the start of the function, of course).
(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)
> even with the compiler often trying to cheat with nopw instructions to warm up the data cache
What?
> Now compare that with the garbage GCC generated. At -O3 no less! What a bunch of garbage!
You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?
> you forced me to write code for a processor architecture I do not even know
I didn't force you to do anything, I specifically asked you to provide an example of your choice. You could have chosen a piece of code where you knew in advance GCC would do badly. You could have chosen an architecture that you know well and/or one where you know that GCC's backend hasn't had as much work as x86-64. You chose not to do this. I didn't force you to do anything.
> And you doubted me
Based on the data it looks like I was right to doubt you.
"A human would turn the for loop that counts up into a do-while loop that counts down?"
Of course an experienced coder would do that; that's the entire point of why a compiler will never be able to generate as short and as fast of code as a human would. An experienced coder can see opportunities that a program with a hard-coded logic such as a compiler is incapable of seeing; if it were, we would have true artificial intelligence, with the neural network of a human brain or denser.
"That's doable on the C level,"
For some bizarre reason unbeknown to me, you just cannot seem to get your head around writing a program in assembler from start to finish. An experienced assembler coder looking for performance would not use a high level language like C because it would be a waste of her or his time, since they already know from experience that they can easily beat hard-coded program logic of a compiler (and if they did, they would chose Fortran over C). Programming in a high-level language only makes sense when the trade-off of losing performance is acceptable in the name of portability. Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?
"What?"
Some versions of GCC for intel 80x86 family of processors will generate one or more nopw instructions, the idea being to give the processor time to prime the data cache. It's a cheap gimmick of desperate compiler constructors, more of a gamble really, because cache games heavily depend on a very particular processor model in a processor family.
"At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization."
I could have vectorized this myself (that was actually the first thought I had) but chose not to because I didn't want to spend the time researching how to do that on the intel 80x86 family of processors; I could have also unrolled the loop about four to eight times for even more of a performance gain, but chose not to do so because I believe I made my point. Now you are just being purposely obtuse, because it seems that you just don't want to accept that high level compilers still aren't as fast as a human coder is: you chose a dot product because you likely knew that the compiler would generate pretty fast code, but that one isolated example doesn't really prove anything. If you do more comparisons over decades like I have, you will see how silly it was trying to argue that compilers generate faster code than coders.
Also, -ffast-math can generate code which will run the fastest only on that particular processor (and might generate instructions which will not run on other processors of the same family), but the far worse danger is that -ffast-math won't generate IEEE-754 compliant results. When one needs very high precision floating point arithmetic, -ffast-math cannot guarantee correct results. For these two reasons, -ffast-math is useless in real life applications. And indeed, wouldn't you know it:
-ffast-math
Sets the options -fno-math-errno,
-funsafe-math-optimizations, -ffinite-math-only,
-fno-rounding-math, -fno-signaling-nans and
-fcx-limited-range.
This option causes the preprocessor macro
"__FAST_MATH__" to be defined.
This option is not turned on by any -O option besides
-Ofast since it can result in incorrect output for pro-
grams that depend on an exact implementation of IEEE or
ISO rules/specifications for math functions. It may,
however, yield faster code for programs that do not
require the guarantees of these specifications.
"(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)"
There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction. On the Motorola MC680x0 family, both run at the same speed. Up until that time, I have never in my entire life written a single line of assembler code for the intel 80x86 family of processors. What you saw is the first assembler code I ever wrote for that platform; I even had to look up which registers to use for floating point and how to do floating point arithmetic on that family of processors, and my hard target was less than 15 minutes of total time wasted on a someone is wrong on the InterNet (https://www.xkcd.com/386/) discussion. If I could do that in under 15 minutes for a processor family I literally know nothing about, imagine what a real coder experienced in writing 64-bit assembler code for the intel 80x86 family of processors can do.
"You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?"
I haven't because I was on a mobile telephone and it was just too much of a bother to quote and reply what I wanted to address. I had to log into a workstation to reply to this and I resent that too. I don't like all of the above, the code and the metadata the compiler generated, because it's machine generated nonsense which jumps back and forth and it is hard to read and comprehend. Assembler code written by a human isn't. Now, since I already resent what I'm doing, I might as well go all the way. Let's pick apart the compiler generated examples in reverse order.
movslq %edx, %rdx
movslq isn't even necessary, but the compiler, having hard-coded logic and being just a program generated it anyway. Waste of CPU cycles.
.p2align 4,,10
.p2align 3
more assembler directives because a compiler cannot figure out whether such packing is necessary or not. My code didn't need them because I know that I don't.
leal 1(%rdx), %eax
testl %eax, %eax
again, nonsense: two instructions wasted with what could amount to a single subq. This isn't more efficient.
rep ret
Why, when a ret will do? Obviously in my code it worked, but the compiler felt that it had to generate this nonsense, which on top of being nonsense is harder to read and understand, and isn't any faster.
Next example:
testl %edx, %edx
jle .L4
leal -1(%rdx), %eax
testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0, presumably to be able to utilize the instruction pipeline, but all the jump does is waste cycles.
xorl %eax, %eax
why xorl %eax in a loop when it could be done once? More waste of processor cycles. That code is extremely inefficient, and hopes to trade off readability for banking on a specific feature of intel 80x86, hoping that the instruction cache and pipelining will offset the losses. Idiotic in the extreme considering a human could get the same performance with far less and simpler code!
"Based on the data it looks like I was right to doubt you."
It might look that way to you now, but if you choose to acquire more experience in coding in assembler, you will eventually realize that I'm correct. You just need more experience, and from that will come insight. Run more tests on more algorithms; study the intel / AMD architecture manuals. Try to construct your own compiler, which I think will really make you realize why a program cannot beat a human. Shame only that you have to deal with such a shitty family of processors, but even that insight will be extremely valuable.
> Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?
It makes sense at times, in specific contexts. Which is a far cry from your "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine).
> nopw instructions, the idea being to give the processor time to prime the data cache.
I'm still only 95% sure what you mean by "priming the data cache", but if you mean prefetching, (a) nop doesn't prefetch; and (b) there are prefetch instructions for prefetching. And in any case prefetching wouldn't be cheating.
> If you do more comparisons over decades like I have
Compilers have gotten much better over the last few decades. Maybe the things you "know" are "always" true were true at some point in time and aren't true anymore.
> the far worse danger is that -ffast-math won't generate IEEE-754 compliant results
Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results. If you're OK with changing the semantics, you should use -ffast-math and let the compiler vectorize.
> There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction
For whatever it's worth, I think sub would be incorrect if the original bit pattern in the register were a NaN.
> two instructions wasted with what could amount to a single subq. This isn't more efficient.
And yet, somehow, this version of the code runs faster than your version.
> testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0
That's the code for testing whether the for loop should ever be entered. If n is less than or equal to zero (that's what's being tested by testl/jne), the function returns zero. Which is why xmm0 (the return register) needs to be zeroed in that case.
> why xorl %eax in a loop when it could be done once?
It's not in a loop. It's the "i = 0" setup code before the loop.
You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't.
Two last points before I drop this thread:
> you will see how silly it was trying to argue that compilers generate faster code than coders
I didn't argue that. I argued that your assertion "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine, again) was incorrect. That doesn't mean that I think that gcc will always, or even sometimes, beat human coders. But it can match them very very often.
> you chose a dot product because you likely knew that the compiler would generate pretty fast code [...] More waste of processor cycles. [...] Idiotic in the extreme
You're contradicting yourself. And you are calling idiots the many GCC developers and Intel/AMD microarchitecture experts who very likely have pored over every single instruction of this very code and decided that this is the way it should be written for maximum performance.
I wrote nopsw and you are writing about nop. Are you doing this on purpose? nop doesn't, but only on intel family of processors, nopsw has a side-effect of prefetching.
"Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results."
Turns out, so did the GCC compiler, at least the one I have, so I'd say your point is moot.
Truth of the matter is, you picked a really bad example: to solve it correctly, one would have to implement at least a portion of the algorithms in the GNU multiple precision library ("GMP"). I suspect you picking a floating point example was not by accident.
"I think sub would be incorrect if the original bit pattern in the register were a NaN."
Even NaN has to be represented by a bit pattern, so subtracting that bit pattern from the register will yield zero.
"That's the code for testing whether the for loop should ever be entered."
And here we come back to my point: if you were coding this from scratch in assembler, you wouldn't write a generic function, and you'd know that n will never be zero. And the reason why you'd never write a generic function is because they lose you speed and increase code size. But a compiler cannot know that and cannot optimize for such a situation. It's just a dumb program.
"You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't."
I spent 30 seconds looking at assembler code for a processor family I have never coded on. I spent less than 15 minutes writing a piece of optimized assembler code for that family and using GNU as, an assembler I never wrote code in. Now you judge me on mis-interpreting one clumsily generated compiler instruction. By which logic, considering I was able to do all of this in under 15 minutes am I not qualified? I'm very pleased with myself, for the time budget, an unknown processor and unknown assembler I think I did very well. We will have to disagree, vehemently if you please.
I stand by my assertion that a compiler will never be able to beat a human at generating fast, optimized code, nor will it ever be capable of generating smaller code. In addition I don't hold the GCC developers in high regard, considering how notoriously bad their compilers are when compared to say, intel or Sun Studio ones. Even the Microsoft compilers beat GCC in generating code which runs faster. In fact, pretty much every compiler beats GCC in performance, which means that people working on those GCC compilers aren't good enough. GCC's only undisputed strength is in the vast support of different processors. There, they are #1, but everywhere else they're last. The GCC developers just don't have what it takes to be the best in that business.
"And yet, somehow, this version of the code runs faster than your version."
I don't know that; you ran code which I wrote blindly; I was not even able to reproduce your output with my GCC. That it runs faster is just your assertion. Based on my experience, I have no reason to believe that.
"I hope you have a wonderful day."
As a matter of fact, I am about to go create a SVR4 OS package of GCC 9.2.0 which I patched and managed to bootstrap after a week worth of work on Solaris 10 on sparc, so yes I will have a wonderful day enjoying the fruits of my labors. I wish you a wonderful day as well.
A human would turn the for loop that counts up into a do-while loop that counts down? That's doable on the C level, and in that case GCC also generates less ceremony:
But of course if you do this you have changed the implied original spec, and you have reassociated a floating-point computation, computing different results. At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization.> But I'll leave it in there because this handily beats the compiler in both size and speed
Not on my machine. On arrays of 1048576 (2 ^ 20) elements I didn't measure any difference between my first variant, the do-while variant and yours. On arrays of 1024 elements I still didn't measure a difference between the two compiled versions, but yours is about 1-2% slower. On 128 elements do-while looks a bit slower than for, and your variant is more than 20% slower.
Here are some numbers for n = 128, calling each function 10000000 times. dot_product is the original, dot_product_2 uses do-while, dot_product_3 is your code (with an extra move of edx to rax at the start of the function, of course).
(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)> even with the compiler often trying to cheat with nopw instructions to warm up the data cache
What?
> Now compare that with the garbage GCC generated. At -O3 no less! What a bunch of garbage!
You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?
> you forced me to write code for a processor architecture I do not even know
I didn't force you to do anything, I specifically asked you to provide an example of your choice. You could have chosen a piece of code where you knew in advance GCC would do badly. You could have chosen an architecture that you know well and/or one where you know that GCC's backend hasn't had as much work as x86-64. You chose not to do this. I didn't force you to do anything.
> And you doubted me
Based on the data it looks like I was right to doubt you.