Throwing multiple cores at the problem is not what I would call beating decades of optimized C. The author only utilizes multiple cores when he realizes the overhead of Haskell will not allow him to actually win. Author even admits the C program was more efficient. You can multi thread a C program as well, at which point it would retake the title.
It's sad to see this kind of Stockholm syndrome. The author is fighting hard against the Haskell compiler to fit a square peg in a round hole. Programming languages are just tools, not religions. Programming languages are fake abstractions of the machine. When one doesn't work for you, you should have no problem switching to another. Loyalty to literally an arbitrary man-made mental model at the expense of your real life productivity makes zero sense.
Funny how different people read this differently. I read it as a playful exploration of how far can you push your square peg into that round hole. No one uses Haskell as a faster C irl, that would not make much sense, that’s not where the strength of the language lies.
There seems to be an unhealthy obsession with beating C in some corner of the Haskell community. It makes even less sense nowadays when some of the most popular languages out there are Javascript and Python, clearly C-grade performance is not required anymore by the vast majority of applications.
I actually wrote a comment (in 2013, but the page I'm referencing doesn't appear to have changed one bit) where I shared my experience looking at haskell.org's introduction that was filled with FUD regarding C and how Haskell was so much better and faster:
> There seems to be an unhealthy obsession with beating C in some corner of the Haskell community.
I think it's about the same for pretty much any language in the "statically typed, compiled" camp. You'll see "faster than C" claims for Rust and Go as well, for example.
That's true. I guess what makes Haskell stand out to me is that at this point it's fairly clear that the assertion is mostly wrong (i.e. idiomatic Haskell is not faster than idiomatic C at most algorithmic tasks, and often significantly slower) and instead of doing the reasonable thing and saying "additional safety and expressiveness is more important than raw performance for most applications" (something I'd completely agree with personally) they double-down with these meaningless, unfair benchmarks.
Meanwhile Rust is genuinely competitive with C performance-wise and thanks to generics and things like more aggressive inlining can actually equal or even beat C without requiring esoteric micro-optimizations. Of course the drawback is that writing Rust code is significantly more complicated and more restrictive.
I've immersed myself in the Haskell community for the past couple of years and while I'd agree that a few posts try to race a Haskell implementation against a C version, these are in the extreme minority.
I feel that you have mis-characterised a whole community, very few of whom care to make Haskell faster than C and overwhelmingly share your view that "C-grade performance is not required".
Further, in the posts I've read of people comparing to C, if anything C is held up as the benchmark as a mark of respect - not as something to cheat against, not to try to say that Haskell is better. Indeed Haskell's FFI out to C is really nice, and demonstrates the pragmatic side of this ivory-tower language :)
This post first embraces its clickbait article and then highlights that even though the Haskell version may be faster on multicore systems (at least for ascii - the unicode version may be incorrect wrt. unicode spaces), but that in practical situations without caching this may not provide any benefit. You're acting as if OP submitted his binary to Apple, *BSD and all the Linux distros with a pull request saying his is better! It's a fun post golfing some Haskell performance issues.
C-like performance is required more than ever: we have to have monster multicore machines with gigabytes of memory just to run any software. That's absolutely ridiculous.
I have noticed this is a very human pattern. Things have their strengths, and they've got their weaknesses. But if your identity is bound too much to your favourite thing, you have to bend reality in order to make the weakness disappear.
Or as Feynman put it:
The first principle is that you must not fool yourself — and you are the easiest person to fool.
The article starts directly with a warning that the title is a bit provocative.
The main point here is clearly not to sell the existing tool on the sole consideration of binary execution performances. As the article say, with Haskell you can produce more reliable software, thanks to a strong typing, and use higher abstraction levels. Even with a significant loss in performance, this can be a good trade off (depending on the project type of course).
So the point here is that you can have descent performance with a state of the art "high-level garbage-collected runtime-based language".
Also, when counting words the C code is actually handling multibyte sequences, which is one of the biggest sources of improvement for the Haskell code (8s to 3.5s).
Having to use a syntax extension explicitly to forgo laziness feels like cheating to me... But then I have never used Haskell so perhaps that's actually normal.
If you're talking about BANGPATTERNS, that is completely normal in Haskell.
Pretty much any non-trivial Haskell program (and many trivial ones too) will have !'s in various places to force something to be strict, and to use the ! syntax you do need to add a pragma. Those pragmas are what let Haskell evolve new features without breaking backward compatibility, so Haskell programs tend to have a lot of them. Perhaps one day we'll have a new standard (it's been 19 years since the last one) which will include some of the pragmas by default. In the meanwhile, it's possible to turn the common ones on project-wide and just forget about them.
Haskell is a lazy-by-default language and the most practical way to disable laziness is that syntax extension.
You can do it without the extension, but it'll turn ugly really quick (lots of uses of the "seq" function everywhere).
In general, the Haskell community is very tolerant of language extensions (people will recommend you enable about a dozen of them in every project), so this is pretty standard.
> Having to use a syntax extension explicitly to forgo laziness feels like cheating to me... But then I have never used Haskell so perhaps that's actually normal.
The `seq` function achieves the same end and is in the standard library. There's nothing cheating about it. The bang patterns extension just provides some nicer syntax for it.
The Haskell version is provably going to work without issue when parrellelized (due to monoid laws), but I doubt the concurrency primitives in C is as trivially usable, and I expect no sane person would want to write that version.
Why? This is not a hard problem and there's nothing wrong with parallelism in C. I'd much rather work on parallelizing the C code than deal with any amount of Haskell. I suspect there are more people who would agree with me than the total amount of people happy to write Haskell.
Whatever the cost in C, Haskell is incurring the same cost, unless it is using green threads.
(You could use green threads in C - or something similar, though to be honest counting words in a file is a bit of an annoying problem for parallelism)
Haskell does use green threads. The thing is, Haskell code is managed by a Haskell runtime, and each runtime manages a single (processor) thread. So the usual parallelism model for Haskell is to spawn a runtime for each processor core and then let each runtime manage its own green threads.
Beating Decades of Optimized C with 27 Lines of Ocaml:
type t = { mutable words: int; mutable chars: int; mutable lines: int; mutable in_word: bool ; mutable finished: bool};;
let () =
match Core.Sys.argv with
| [| prog_name |] -> Core.Printf.eprintf "Usage: %s file1 file2 ...\n" prog_name
| _ -> (
let args = Core.Array.slice Core.Sys.argv 1 @@ Core.Array.length Core.Sys.argv in
let buf_size = 65536 in (* 64 KB -> Caml IO buffer size *)
let buf = Core.Bytes.create buf_size in
Core.Array.fold args ~init:() ~f:(fun _ file ->
Core.In_channel.with_file file ~f:(fun in_ch ->
let c = { words = 0; chars = 0; lines = 0; in_word = false; finished = false } in
let set_words () = if c.in_word then c.words <- c.words + 1 in
while not c.finished do
let len = Core.In_channel.input in_ch ~buf ~pos:0 ~len:buf_size |> Core.Int.to_int in
if len > 0 then (
for i = 0 to (len - 1) do
match (Core.Caml.Bytes.get buf i) with
| ' ' | '\t' | '\r' -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false)
| '\n' -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false; c.lines <- c.lines + 1)
| _ -> (c.chars <- c.chars + 1; c.in_word <- true)
done
) else ( c.finished <- true )
done;
set_words ();
Core.Printf.printf "%s -> lines: %d, words: %d, characters: %d\n" file c.lines c.words c.chars)))
;;
Testing:
$ ls -lh test_file.txt
-rw-r--r-- 1 user group 508M Oct 16 12:01 test_file.txt
$ time wc test_file.txt
4863460 54621760 532480000 test_file.txt
real 0m3.368s
user 0m3.137s
sys 0m0.195s
#Ocaml version:
$ time ./wcl.native test_file.txt
test_file.txt -> lines: 4863460, words: 54621760, characters: 532480000
real 0m2.480s
user 0m2.345s
sys 0m0.123s
While I wouldn't be surpised that OCaml is fast, I'd like to be sure your disk cache was warm in both cases. I'd test it myself, but I don't have OCaml setup on my work computer.
I setup it on my computer, used it to count the lines of 250 files, it is indeed faster than my system's (ubuntu 18.04) wc, but it doesn't compute the total. I don't think this should change the timings much though.
A best of five run of the system's wc takes 0.042s, vs 0.028s for the OCaml version.
I was going to say that it is unreadable, after spending a bit trying to understand it, (for someone that knows no ocaml) it actually is not. Only things I'm unsure are the ~ symbol and <-. What do they do?
edit: <- is just mutating assignment. I thought that ~ introduced named arguments but doesn't seem right.
Not really, the initialization of the JVM is fast(er), but initializing the Clojure runtime took a lot of time. At least in 1.7, not sure if it got better in the meantime.
I could start and run Java programs way faster than just starting the Clojure REPL.
What a clickbait indeed. The end version is still eating 3 times more memory, but the worst is that it just doesn't work in the general case.
The way I generally use "wc" is inside a somewhat complex command line, with commands preceding it. As in "feeding characters" to it.
There is a reason why "wc" is not multithreaded: it just can't. It must work sequentially, because in the general case the input of "wc" cannot be skipped over.
This is one of the two big assumptions that are made by the author ("wc works only for files, so we can lseek") -- the second, identified by the author, being that the underlying hardware and filesystem must support concurrent access to the same location efficiently.
> There is a reason why "wc" is not multithreaded: it just can't. It must work sequentially, because in the general case the input of "wc" cannot be skipped over.
Eh what? Word counting is embarrassingly parallel, at least for files.
You start up k threads each working on its own chunk of the file, assuming that there is a word boundary at the start of its chunk. Then you sum up the total word count but inspect whether the chunk boundaries actually were word boundaries, and subtract 1 each time this was violated.
> Eh what? Word counting is embarrassingly parallel, at least for files.
wc does not make the presumption that it is working on a file that can be seeked into (and, more importantly, re-wound). in c++ terms, it behaves as though the source it is reading from is an "input iterator"[1].
if you are piping into wc, then you'll never be able to parallelise it. word counting a file is the edge case here, not the general use.
You don't need to seek: Spin up workers, have them wait on a mutex, and read blocks. If the producer is slow, then you won't need to feed multiple workers. If it's fast, it's a matter of finding the tradeoff between block size vs. the processing time per block.
i expect you're being downvoted because it's likely that the cost of deciding which thread to assign work to is an order-of-magnitude more expensive than accumulating the reqiured counters.
in any case: you're right, it is of course possible. what i should have said was:
> if you are piping into wc, then you'll never see a performance gain from any attempts at parallelisation.
Well, I'm back at 1 point now. In any case, the amount of work needed to assign to a thread is trivial. For a small number of threads, it at most requires a couple of mutex operations and a couple of load/stores to distribute each block. Given the block size can be set arbitrarily high, the overhead of work assignment per byte can be made arbitrarily low. The biggest issue is that you won't know the input size, and for small input sizes there will be a cutoff point where multiple threads adds cost. But for small input sizes, performance won't be an issue anyway.
EDIT: Actually, the simplest way of assigning work is probably just to have each thread try to acquire a mutex, call read(), and then release the mutex and do their work.
Last week he had "Learning Haskell is no harder than learning any other programming language"[1]. This week we have:
> The program takes more than a few minutes and quickly spikes up to more than 3 GB of memory! What's gone wrong? Well, we used the strict version of foldl (indicated by the trailing tick '); BUT it's only strict up to "Weak Head Normal Form" (WHNF), which means it'll be strict in the structure of the tuple accumulator, but not the actual values!
> 1957 - John Backus and IBM create FORTRAN. There's nothing funny about IBM or FORTRAN. It is a syntax error to write FORTRAN while not wearing a blue tie.
I believe in the cited article the author was simply pointing out to the fact that it is easy to learn the syntax and basic semantics of the language.
It is extremely hard to write high performance Haskell, same as it is hard to write high performance C. And additionally I think it is foolish to claim that "Haskell is no harder to learn than any languages". Definitely there are certain misconceptions (like you need to know category theory to learn Haskell) which needs to be debunked, but it is quite difficult to grasp the evaluation semantics of Haskell and pretending that it is easy and natural will simply shove more users away from the language.
Huge Haskell fanboy here. True high performance, fair enough that's hard everywhere. I think it is unlikely a novice C programmer would accidentally balloon their memory like this. The simplest Haskell program that a novice might write has abysmal performance.
I think it is easier to write clear, succinct code in Haskell and easier to write decently fast code in C. Neither is impossible in either.
The previous article spoke about learning enough Haskell to write useful programs. It didn’t speak about learning enough Haskell to do performance optimization. I don’t think there exists a language where it’s enough to learn the surface language in order to write optimized code.
Also, the previous article recommends newcomers to use the “Strict” language extension, which would avoid the exploding memory usage of the second version (“simple-fold-wc”).
You're implicitly assuming that there are only two classes of problems: ones where performance doesn't matter at all, no matter how awful, and ones where one is willing to expend unlimited amounts of wizard-level optimization effort to achieve maximum performance.
On the contrary, in practice, it is important that the obvious code a programmer would write should be reasonably efficient (let's say, within an order of magnitude time and space of the optimal single-core solution). For this (very simple problem), this article makes it clear that is not true in Haskell. In contrast I believe it would be true in Rust and even Python (without having written the programs, but using BufReader::lines() and String::split() in Rust and the lines iterator and split() in Python).
It's also a problem for Haskell that the first thing you do for performance is to disable one of the key features that distinguishes Haskell from other languages.
Rust is generally fast-by-default. My favourite example is the collections API, where you can do mapping/reducing/filtering operations only by inlining multiple operations and going through the collection only once.
With Haskell these things are possible, but not efficient by default (the default Haskell library uses liked lists instead of vectors for example).
"I don’t think there exists a language where it’s enough to learn the surface language in order to write optimized code."
Good news, there is indeed such a language: Fortran. Modern Fortran syntax is similar to Python (but without indentation restrictions) and the compiler generates wicked fast machine code.
I don't think he problem here is necessarily with the language.
Because Haskell is not just a language - it's a whole culture and by learning it you have to adopt that culture. That's the degree of mental shift you have to go through to really understand it.
Ultimately you don't become a Haskell programmer, but a Haskell person.
Exactly. Do people complain about skiing being “hard” compared to walking? Or cycling for that matter? Nope. (Well kids will complain about falling off their bike once or twice)
Is skiing even hard? Crawling -> Walking is arguably a lot harder.
The Duff's device is a low level optimization that can even hurt modern processors due to branch prediction. It's a niche trick from the olden days.
It is very strange on first sight that a[index] is same as index[a], but you will never see the second one used in practice. When you are learning, you can write your code using normal array access syntax and solve problems. And then, you can learn to use pointers and deref them. And then one day you might realize that x[y] is *(x+y) and thus y[x] and that it looks funny. But that will be a long time after writing `wc`.
The alignment doesn't matter. `index` is not an offset in bytes, it's just an index, a number. The position in memory of the variable also doesn't matter as it is not used, only the value.
`index[arr]` means `*(index + arr)`.
Which is exactly the same thing as arr[index], no matter how you look at it. The type of arr is taken into account for pointer arithmetic.
Whatever works with `arr[index]` will also work with `index[arr]`.
You can also use a constant, for example `2[arr]`.
yes, I realized this after asking the question :).I must admit I never thought it in this way.
Thinking more about this, this might fail if the compiler uses post-incrementing load/stores which update the pointer addresses too. Or will it not use them if I am using index[arr] ?
Lets say the load/store updates the pointer too. So after first execution, ptr will index+arr in both cases, but during the next iteration , the former will be ptr[index] while the latter will be ptr[arr].
Unless you try hard to provoke one, there is no unaligned access. Adding a pointer and an integer in C always displaces the pointer by elements, not by bytes. So if p is a pointer to some type T and is aligned, then p + 1 will be aligned as well.
You need to cast to another pointer type, do arithmetic on that, then cast back to produce an unaligned access.
No, it is strange, because C decides to dereference according to the datatype of "a" in both cases, and not according to how the user syntactically declares what is to be understood as array and index.
In other words, if a is char[]/char* and index is int, C does
*((char*)a + (int)index) and *((int)index + (char*)a)
and not as one could expect
*((char*)a + (int)index) and *((int*)index + (char)a)
I always understood it as some kind of compiler-level macro, where a[b] is expanded into <star>(a + b) and then normal arithmetic rules take over, in which case there's no strangeness as far as I can tell (i.e. it doesn't matter which of a or b is the pointer and which is the integer because C defines both "int + ptr" and "ptr + int" as doing the same thing).
But frankly it's a bad choice if we're talking about C's "esoteric quirks" because both this and Duff's Device are things that you effectively never encounter in the wild (it's the "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" of the C language, they're interesting to know about but not of much practical use).
There are much better "features" of C to single out, for instance the two million different meanings for the "static" keyword, the fact that arrays declared as function parameters do not behave like arrays but like pointers (hey, you can put "static" in there too now! But it still doesn't make it work like a proper array), the many surprising undefined behaviors regarding integer overflow, aliasing rules etc...
Adding a pointer and an index yield a pointer no matter which side the pointer is, i.e operator+ is commutative. It would be extremely surprising if switching sides of the sum would implicitly convert an index to a pointer.
You could argue that operator[] should not allow a non pointer on the LHS, and you would probably right, but as it stands now, it is just syntactic sugar for operator+ and just dereferencing the while expression.
understanding pointer dereferencing and its meaning is no different than understanding lazy evaluation and strict evaluation in terms of difficulty. I would argue that lazy and strict is easier a concept to understand.
The reason a lot of people think that haskell's lazy/strict behaviour is quirky is only because they are not familiar with functional programming concepts.
Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array (zig-zagging one way is faster than the alternative). But this info is just as difficult to have learnt, 'cept that as a seasoned C programmer, you would've likely to have encountered this already.
>understanding pointer dereferencing and its meaning is no different than understanding lazy evaluation and strict evaluation in terms of difficulty. I would argue that lazy and strict is easier a concept to understand.
Not even close. Pointer and dereference are something we do all the time in the real world (e.g. a pointer is like the semantic between word and thing) and can have easy to understand examples (the classic storage box with holds a value, and the pointer being the number of the box, etc). Still tons of programmers can't get their head around them...
Lazy/strict is far more challenging, and whereas pointers have no mystery after understanding the basic mechanism, telling how a lazy program behaves and spends memory etc is nigh impossible above a certain complexity -- heck, this is a common lament of seasoned Haskell programmers.
>Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array (zig-zagging one way is faster than the alternative).
Many bat an eye. Heck, many not even get pointers, much less cache lines, and optimal access/sharing patterns...
> Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array
the vast majority of programmers won't have any idea about any of that. Arguably (not by me BTW) they don't need.
edit: also, caches are a necessity of modern (and not so modern) computer architectures, every program has to deal with them. Laziness is sort of a self inflicted weakness of Haskell.
edit2: I can't write any Haskell, but I can read a bit of it (like python, it makes for excellent pseudocode I think). Sometimes I fail to understand how some algo could possibly work, but then 'ah, right, laziness'. I'm sure it comes natural after a while, but it is definitely a barrier.
>Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array (zig-zagging one way is faster than the alternative).
I don't follow, this is not really about programing languages and more about optimizing for a given architecture. Surely this problem would also arise if you're trying to optimize Haskell code? Or, for that matter, assembly, Go, Pascal or any language that lets you have low-level control on memory layout of your data.
Strange, but unsurprising after seeing the examples, IMO. Knowing that the memory spike in Haskell is due to the strict form of foldl (as indicated by the tick) only being strict up to "weak head normal form" seems quite a bit more esoteric to me.
But maybe I am biased due to learning more about how computer implementations fundamentally operate (in which case the only surprising thing about a[b] == b[a] is that the C standard allows it) than FP theory and how that abstraction can be baked down into something that the computer can actually execute efficiently.
And haskell does this.
This is way over engineering for performance. Different level. It's the same in other language when you go deep into performance engineering, it just depends how deep considering if the algorithm fits better a functional or imperative model.
There are a couple special cases for a line count and a total character count, but it's essentially a couple of nested loops.
I agree with you that going deep into performance engineering brings out a lot of esoteric edge cases, but in this case for the C program it wasn't even necessary to go down that path. Basic concepts in C outran or at least matched something heavily optimized in Haskell.
I'm a big fan of higher level languages, and use them almost exclusively myself in my day job, but there's something to be said for basic constructs that just run quickly out of the box. It may be harder to optimize the C ode, but you're also less likely to need to do so.
Disclaimer: old timer (not sure if you whippersnappers do the same anymore).
The few times that I really needed performance out of a particular C/C++ function, I'd enable the compiler's output to generate ASM, and then see what was being generated, and either rework the code (to keep it readable for the next guy), or optimise directly with ASM, and often an iterative mix of the two. Loop unrolling (as in Duff's device) was a common one for array iteration performance improvements. I don't see a lot of "optimized C" in the wc impl. I'd imagine that wc optimised for use in a GPU would be pretty spectacular, even beyond using a few ordinary cores like the Haskell result did... but as you say, it's pretty unnecessary in this case.
A lot of that boils down to picking an algorithm that lets them defer as much work as possible until you know you need to do it. I suppose that's one of lazy evaluation's claims to fame, but I think the level reached in Grep would be difficult to achieve in a more modern functional lannguage.
That's a quirk of the standard library. I don't think you need to know this in order to have practical knowlege of the language.
Today I saw somebody implementing C++'s std::is_integral in C# and the attempt to generate random values of generic type. It wasn't pretty, but stuff like this does not make C# a hard language.
No, laziness has nothing to do with iterating over the file multiple times. If you perform three separate operations on the same file (as the first, naive version does), you will iterate over the file three times regardless of the evaluation strategy.
The high memory usage is related to the lazy evaluation of the arguments to the “go” function. When these arguments are evaluated strictly, iterating over the file three times would increase running time by a factor of three, but the memory usage would still be low (ie. roughly proportional to the file size).
In order to avoid iterating over the file three times, he writes the second version, which does require understanding laziness (in order to avoid high memory usage).
that has nothing to do with programing language, but with tools/libraries.
Haskell is one of the simplest programming languages, but judging it by that statement is judging java for spring2 AbstractSingletonProxyFactory bean, ruby for all of rails monkey patching and magic, etc.
How is it so? Lazy evaluation (and thus the concept of a (weak-head) normal form, and the possibilty of space leaks) is pretty fundamental to Haskell. It's also impossible to write generic folds (i.e. without adding an NFData constraint, which comes from outside the standard library) that evaluate further than WHNF. And just making all things may lead to worse performance or even non-termination in some cases, anyway.
Writing Haskell is almost trivial in practice. You just start with the magic fifty line {-# LANGUAGE ... #-} incantation to fast-forward to 2017, then add 150 libraries that you’ve blessed by trial and error to your cabal file, and then just write down a single type signature whose inhabitant is Refl to your program. In fact if your program is longer than your import list you’re clearly doing Haskell all wrong.
To be fair for the C version: The linked code [1] doesn't really look like "decades of optimized C" to me. More like a straightforward implementation of this, which had been likely had not seen many updates for decades.
It could probably gain performance by playing around with input buffer sizes and trying to make use of vectorized instructions (the compiler might try to auto-vectorize the lops, but there is no guarantee).
> We're still orders of magnitude away from wc unfortunately
Just a reminder that “orders of magnitude” means something specific, rather than “quite a bit”, and 3 is not “orders of magnitude” more than 0.3, and neither is 5.48 “orders of magnitude” more than 1.86
I grabbed the source linked from the article, made it run on Linux, compiled it with gcc -O3, and it beats my system's wc by a factor of almost 2x. C is faster than C!
This is on Ubuntu 18.04 on x86_64. If the author's wc also came precompiled (which it seems to be), they cannot make any valid comparison at all. If manually compiling on their machine with maximum optimization would also get them a 2x speedup, their fancy multicore Haskell would still be slower than C... and that's fine.
I find funny this "beating C with X" attempts and articles, popular since the dawn of computing.
Although all of them are trying hard to be a little bit faster (and shorter) than C code, this is usually the price they have to pay:
1. Abysmal compilation time.
2. Huge binaries. Sure, C static binaries aren't small, but much smaller than equivalent Go programs.
3. Complex language. C is still simple language (sans edgy cases). No need to dive into type theory, learn about functors and so on just to write something.
4. Complex implementations (compilers, garbage collectors...). Not something you can hack over the weekend.
5. Complex distribution. C natural environment is Unix/Linux and libc/glibc is always there. Everything else requires additional (usually tons of) libraries or static linking.
6. Language instability. Because of 3), these languages are always evolving. You can read and compile C code 20-30 years ago and I'm expecting the same for the next, at least, 10 years.
Very interesting, and a simple enough use-case that a Haskell novice can get the gist of it. It's very informative of how real Haskell development works; it's a brass-tacks kind of example which you don't necessarily see a lot of in the Haskell world.
Every ”X new thing I made on a weekend is faster than Y old thing” is inevitably a half-implemented version of the old thing. Of course doing less instructions is going to be faster. You could also make a benchmark-winning web server that just write()s to a socket, skipping all sanitization and error handling.
Yes, the C is much too readable. Also I wonder how fresh it is - the Macs have been known to run some really old versions of GPL tools due to license issues with newer versions.
For a start, getc() isn't a simple macro accessing read buffer anymore, but a function which always acquires/releases a lock (even in single-threaded program). Changing it into getc_unlocked() usually makes big difference.
In general terms, how is it possible to beat really optimized C code? In particular, what property of Haskell or any other functional language would make that possible (in theory or practice).
The C code is not particularly optimized, but more importantly, C being a low level language makes writing multi-threaded programs much more work. wc though cannot make use of multi-threading regardless because it must have the ability to work on streams, not just seekable files, as the haskell code here does.
C is a high level language with capability of low level access: we regularly made fun of C programmers in high school (like my math teacher) because they didn't know how to code in assembler and had to write abstract high level code in C, in order for the compiler to do it for them. The biggest brouhaha was always disassembly of "optimized" machine code generated by the C compiler, full of needless stack twiddling and housekeeping code, wasting hundreds or even thousands of CPU cycles on what amounts to useless abstractions. That's nonsense one finds only in a high level language. But it's always super for a good brouhaha!
We've long since moved on from an era when anything above assembly could be considered high level.
And the performance difference between idiomatic C code compiled with -O3 (to inline functions and minimize the stack twiddling and housekeeping code) and handwritten straightforward assembly is low enough as to be negligible. Not so with, say, Haskell vs C.
Neither is true. Well, the first might be true for you in your mind, but doesn't change the fact that C is a high level language modelling an abstract system, and providing that model through its standard libraries (which is what makes it possible to write portable code).
And the second with -O3 could not be more wrong, and I can prove it: compile anything with gcc -S -O3 and behold the extra code which a human would never write because it's unnecessary.
Considering I wrote above exactly what to do (gcc -S), I believe you are not making this request in good faith; rather, you seek to test whether I know what I'm writing about. However, had you done what I told you to do, you wouldn't have asked. That's why I think you are actually trolling me.
This is how a human would code the same (a human would know, for instance, that (s)he wants to multiply 50 times, passing that in %rax (that would be your "n" in C), something a compiler can never know, which is one of many reasons (and there are many!) why a compiler can never, ever beat a human at coding in assembler:
...potentially, depending on how 80x86 processors work, the Z flag might be set when subq $1, %rax reaches 0 and the cmpl $0, %rax could be optimized away, making the code even shorter (and faster). But I'll leave it in there because this handily beats the compiler in both size and speed, even with the compiler often trying to cheat with nopw instructions to warm up the data cache. Now compare that with the garbage GCC generated. At -O3 no less! What a bunch of garbage!
On a personal note, you forced me to write code for a processor architecture I do not even know (80x86 family of processors in 64-bit mode), at all, so you just assumed my primary system is powered by that intel / AMD garbage of processors, and I don't appreciate that one bit. And you doubted me, which I appreciate even less.
> a human would know, for instance, that (s)he wants to multiply 50 times
A human would turn the for loop that counts up into a do-while loop that counts down? That's doable on the C level, and in that case GCC also generates less ceremony:
But of course if you do this you have changed the implied original spec, and you have reassociated a floating-point computation, computing different results. At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization.
> But I'll leave it in there because this handily beats the compiler in both size and speed
Not on my machine. On arrays of 1048576 (2 ^ 20) elements I didn't measure any difference between my first variant, the do-while variant and yours. On arrays of 1024 elements I still didn't measure a difference between the two compiled versions, but yours is about 1-2% slower. On 128 elements do-while looks a bit slower than for, and your variant is more than 20% slower.
Here are some numbers for n = 128, calling each function 10000000 times. dot_product is the original, dot_product_2 uses do-while, dot_product_3 is your code (with an extra move of edx to rax at the start of the function, of course).
(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)
> even with the compiler often trying to cheat with nopw instructions to warm up the data cache
What?
> Now compare that with the garbage GCC generated. At -O3 no less! What a bunch of garbage!
You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?
> you forced me to write code for a processor architecture I do not even know
I didn't force you to do anything, I specifically asked you to provide an example of your choice. You could have chosen a piece of code where you knew in advance GCC would do badly. You could have chosen an architecture that you know well and/or one where you know that GCC's backend hasn't had as much work as x86-64. You chose not to do this. I didn't force you to do anything.
> And you doubted me
Based on the data it looks like I was right to doubt you.
"A human would turn the for loop that counts up into a do-while loop that counts down?"
Of course an experienced coder would do that; that's the entire point of why a compiler will never be able to generate as short and as fast of code as a human would. An experienced coder can see opportunities that a program with a hard-coded logic such as a compiler is incapable of seeing; if it were, we would have true artificial intelligence, with the neural network of a human brain or denser.
"That's doable on the C level,"
For some bizarre reason unbeknown to me, you just cannot seem to get your head around writing a program in assembler from start to finish. An experienced assembler coder looking for performance would not use a high level language like C because it would be a waste of her or his time, since they already know from experience that they can easily beat hard-coded program logic of a compiler (and if they did, they would chose Fortran over C). Programming in a high-level language only makes sense when the trade-off of losing performance is acceptable in the name of portability. Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?
"What?"
Some versions of GCC for intel 80x86 family of processors will generate one or more nopw instructions, the idea being to give the processor time to prime the data cache. It's a cheap gimmick of desperate compiler constructors, more of a gamble really, because cache games heavily depend on a very particular processor model in a processor family.
"At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization."
I could have vectorized this myself (that was actually the first thought I had) but chose not to because I didn't want to spend the time researching how to do that on the intel 80x86 family of processors; I could have also unrolled the loop about four to eight times for even more of a performance gain, but chose not to do so because I believe I made my point. Now you are just being purposely obtuse, because it seems that you just don't want to accept that high level compilers still aren't as fast as a human coder is: you chose a dot product because you likely knew that the compiler would generate pretty fast code, but that one isolated example doesn't really prove anything. If you do more comparisons over decades like I have, you will see how silly it was trying to argue that compilers generate faster code than coders.
Also, -ffast-math can generate code which will run the fastest only on that particular processor (and might generate instructions which will not run on other processors of the same family), but the far worse danger is that -ffast-math won't generate IEEE-754 compliant results. When one needs very high precision floating point arithmetic, -ffast-math cannot guarantee correct results. For these two reasons, -ffast-math is useless in real life applications. And indeed, wouldn't you know it:
-ffast-math
Sets the options -fno-math-errno,
-funsafe-math-optimizations, -ffinite-math-only,
-fno-rounding-math, -fno-signaling-nans and
-fcx-limited-range.
This option causes the preprocessor macro
"__FAST_MATH__" to be defined.
This option is not turned on by any -O option besides
-Ofast since it can result in incorrect output for pro-
grams that depend on an exact implementation of IEEE or
ISO rules/specifications for math functions. It may,
however, yield faster code for programs that do not
require the guarantees of these specifications.
"(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)"
There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction. On the Motorola MC680x0 family, both run at the same speed. Up until that time, I have never in my entire life written a single line of assembler code for the intel 80x86 family of processors. What you saw is the first assembler code I ever wrote for that platform; I even had to look up which registers to use for floating point and how to do floating point arithmetic on that family of processors, and my hard target was less than 15 minutes of total time wasted on a someone is wrong on the InterNet (https://www.xkcd.com/386/) discussion. If I could do that in under 15 minutes for a processor family I literally know nothing about, imagine what a real coder experienced in writing 64-bit assembler code for the intel 80x86 family of processors can do.
"You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?"
I haven't because I was on a mobile telephone and it was just too much of a bother to quote and reply what I wanted to address. I had to log into a workstation to reply to this and I resent that too. I don't like all of the above, the code and the metadata the compiler generated, because it's machine generated nonsense which jumps back and forth and it is hard to read and comprehend. Assembler code written by a human isn't. Now, since I already resent what I'm doing, I might as well go all the way. Let's pick apart the compiler generated examples in reverse order.
movslq %edx, %rdx
movslq isn't even necessary, but the compiler, having hard-coded logic and being just a program generated it anyway. Waste of CPU cycles.
.p2align 4,,10
.p2align 3
more assembler directives because a compiler cannot figure out whether such packing is necessary or not. My code didn't need them because I know that I don't.
leal 1(%rdx), %eax
testl %eax, %eax
again, nonsense: two instructions wasted with what could amount to a single subq. This isn't more efficient.
rep ret
Why, when a ret will do? Obviously in my code it worked, but the compiler felt that it had to generate this nonsense, which on top of being nonsense is harder to read and understand, and isn't any faster.
Next example:
testl %edx, %edx
jle .L4
leal -1(%rdx), %eax
testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0, presumably to be able to utilize the instruction pipeline, but all the jump does is waste cycles.
xorl %eax, %eax
why xorl %eax in a loop when it could be done once? More waste of processor cycles. That code is extremely inefficient, and hopes to trade off readability for banking on a specific feature of intel 80x86, hoping that the instruction cache and pipelining will offset the losses. Idiotic in the extreme considering a human could get the same performance with far less and simpler code!
"Based on the data it looks like I was right to doubt you."
It might look that way to you now, but if you choose to acquire more experience in coding in assembler, you will eventually realize that I'm correct. You just need more experience, and from that will come insight. Run more tests on more algorithms; study the intel / AMD architecture manuals. Try to construct your own compiler, which I think will really make you realize why a program cannot beat a human. Shame only that you have to deal with such a shitty family of processors, but even that insight will be extremely valuable.
> Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?
It makes sense at times, in specific contexts. Which is a far cry from your "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine).
> nopw instructions, the idea being to give the processor time to prime the data cache.
I'm still only 95% sure what you mean by "priming the data cache", but if you mean prefetching, (a) nop doesn't prefetch; and (b) there are prefetch instructions for prefetching. And in any case prefetching wouldn't be cheating.
> If you do more comparisons over decades like I have
Compilers have gotten much better over the last few decades. Maybe the things you "know" are "always" true were true at some point in time and aren't true anymore.
> the far worse danger is that -ffast-math won't generate IEEE-754 compliant results
Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results. If you're OK with changing the semantics, you should use -ffast-math and let the compiler vectorize.
> There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction
For whatever it's worth, I think sub would be incorrect if the original bit pattern in the register were a NaN.
> two instructions wasted with what could amount to a single subq. This isn't more efficient.
And yet, somehow, this version of the code runs faster than your version.
> testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0
That's the code for testing whether the for loop should ever be entered. If n is less than or equal to zero (that's what's being tested by testl/jne), the function returns zero. Which is why xmm0 (the return register) needs to be zeroed in that case.
> why xorl %eax in a loop when it could be done once?
It's not in a loop. It's the "i = 0" setup code before the loop.
You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't.
Two last points before I drop this thread:
> you will see how silly it was trying to argue that compilers generate faster code than coders
I didn't argue that. I argued that your assertion "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine, again) was incorrect. That doesn't mean that I think that gcc will always, or even sometimes, beat human coders. But it can match them very very often.
> you chose a dot product because you likely knew that the compiler would generate pretty fast code [...] More waste of processor cycles. [...] Idiotic in the extreme
You're contradicting yourself. And you are calling idiots the many GCC developers and Intel/AMD microarchitecture experts who very likely have pored over every single instruction of this very code and decided that this is the way it should be written for maximum performance.
I wrote nopsw and you are writing about nop. Are you doing this on purpose? nop doesn't, but only on intel family of processors, nopsw has a side-effect of prefetching.
"Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results."
Turns out, so did the GCC compiler, at least the one I have, so I'd say your point is moot.
Truth of the matter is, you picked a really bad example: to solve it correctly, one would have to implement at least a portion of the algorithms in the GNU multiple precision library ("GMP"). I suspect you picking a floating point example was not by accident.
"I think sub would be incorrect if the original bit pattern in the register were a NaN."
Even NaN has to be represented by a bit pattern, so subtracting that bit pattern from the register will yield zero.
"That's the code for testing whether the for loop should ever be entered."
And here we come back to my point: if you were coding this from scratch in assembler, you wouldn't write a generic function, and you'd know that n will never be zero. And the reason why you'd never write a generic function is because they lose you speed and increase code size. But a compiler cannot know that and cannot optimize for such a situation. It's just a dumb program.
"You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't."
I spent 30 seconds looking at assembler code for a processor family I have never coded on. I spent less than 15 minutes writing a piece of optimized assembler code for that family and using GNU as, an assembler I never wrote code in. Now you judge me on mis-interpreting one clumsily generated compiler instruction. By which logic, considering I was able to do all of this in under 15 minutes am I not qualified? I'm very pleased with myself, for the time budget, an unknown processor and unknown assembler I think I did very well. We will have to disagree, vehemently if you please.
I stand by my assertion that a compiler will never be able to beat a human at generating fast, optimized code, nor will it ever be capable of generating smaller code. In addition I don't hold the GCC developers in high regard, considering how notoriously bad their compilers are when compared to say, intel or Sun Studio ones. Even the Microsoft compilers beat GCC in generating code which runs faster. In fact, pretty much every compiler beats GCC in performance, which means that people working on those GCC compilers aren't good enough. GCC's only undisputed strength is in the vast support of different processors. There, they are #1, but everywhere else they're last. The GCC developers just don't have what it takes to be the best in that business.
"And yet, somehow, this version of the code runs faster than your version."
I don't know that; you ran code which I wrote blindly; I was not even able to reproduce your output with my GCC. That it runs faster is just your assertion. Based on my experience, I have no reason to believe that.
"I hope you have a wonderful day."
As a matter of fact, I am about to go create a SVR4 OS package of GCC 9.2.0 which I patched and managed to bootstrap after a week worth of work on Solaris 10 on sparc, so yes I will have a wonderful day enjoying the fruits of my labors. I wish you a wonderful day as well.
You are always at the mercy of the compiler, unless you use intrinsics or assembly directly (and then it's debatable whether what you are writing is really "C"). C compilers are pretty good, but there are things they cannot do.
For example, ISPC[1] is a low-level language that is very similar to C, but which makes it straightforward for the compiler to use vector instructions very aggressively - far more than any existing C compiler would be able to.
In theory, C and assembly will give you the fastest possible implementation. However, in practice, we're just not smart enough to write this code without introducing bugs. High level languages allow us to specify what we want the program to do in a concise form and then allow the compiler and runtime to optimize it for us. This works to varying degrees in reality.
A simple example is that pure functions can always be inlined and memoized.
So, if you can write research-level stuff, invent some new category-theoretical structures that warrant a scientific paper, know the exact language extension to enable from the multitude of exotic optimizations available in GHC, than yes, that’s perfectly simple and accessible to everyone!
it is merely a monoid with some special property (which, is a logical property of concurrently counting words - "count the number of times a given invariant has changed from the start to the end of a sequence").
he wrote the monoid directly, rather than import the library - and it's a rather unimportant small detail.
He didn't invent a new research topic that nobody has seen before. You would've implicitly done this same thing if you wrote a map/reduce algorithm to count words across different machines.
As a few have posted, this isn't decades of optimised C vs 80 lines of Haskell, this is basic C vs basic Haskell, and C is orders of magnitude faster.
The entire source for wc is here [0], and is 900 lines of very readable C, it does more than count words, includes options, arguments, usage, licenses. WHen I strip everything back, it's only about 300 lines of very readable C.
In POSIX, a text file is a file consisting of zero or more lines. A line is a sequence of non-null characters, shorter than a system defined maximum and terminated by a line feed character.
So a file that completely matches the string "one two\nthree" is by POSIX standards not a text file. Of course, wc doesn't necessarily operate on text files so you could interpret the result as a "complete POSIX line count".
Why wouldn't they? It's a very good programming language that's underrepresented in the market. I don't get why people get so upset over language evangelists. Is it that you already know the languages they advertise are good, or that you don't want to know?
It's that, by their own demonstration, the language is nowhere near as good as they say it is. What does "underrepresented" even mean, in this context? It is not used much because it is a PITA to get tolerable performance, where the C code -- and, moreso, the C++ code -- written in the most naive way is near optimal.
I am not going to bother demonstrating the parallel C++ version, but you already know it will be only a little longer than the naive one, and quite a lot faster than the Haskell one.
You're right and Haskell doesn't really deserve more representation. I've built a couple of small web services, a network protocol and a compiler in Haskell, and for those cases it was super suitable, god forbid I'd have to do those things in C++. But definitely, there's a whole bunch of programming languages that would have been as suitable or better for those (except the compiler, Haskell really is the perfect compiler language).
I meant underrepresentation more generally, I appreciated Ruby evangelism back in 2006, when adopting Ruby really improved the lives of web developers. And I appreciate Rust evangelism now, because it improves the lives of zero overhead developers. At some point those languages could be at their "optimal" representation though. And not all representation is fair, I still thing it's insane that people are advocating Rust for web development, but I also think it's insane to use Javascript for backend web development, and here we are with half the industry mired in shit.
Haskell's focus is arguably making correct software, not the fastest implementations. Such intensive focus on correctness is not usual. In that light I do feel it's somewhat under-represented.
Ada/SPARK is underrepresented as well. Correctness? You most definitely got it. Performance? Yep. Easy-to-use language constructs for concurrency? Yep. And so on and so on.
This is interesting and also unsurprising. Haskell is an abstract and high-level language and so doesn't dictate irrelevant machine details as C does. The only reason C's considered efficient at all is due to what was effectively a meme from forty years ago; there have been lifetimes of human effort wasted making C fast, only to trivially be beaten by someone using better tools. A proper analogy would be beating the world's faste runner in a mile-long dash by using a bicycle or perhaps using a cart being more efficient than carrying items on one's head; the main difference is people haven't been deluded to think that humans are faster or better at carrying cargo than a creation of mankind.