Beating Decades of Optimized C with 80 Lines of Haskell

Sysreq1 · on Oct 16, 2019

Throwing multiple cores at the problem is not what I would call beating decades of optimized C. The author only utilizes multiple cores when he realizes the overhead of Haskell will not allow him to actually win. Author even admits the C program was more efficient. You can multi thread a C program as well, at which point it would retake the title.

zelly · on Oct 16, 2019

It's sad to see this kind of Stockholm syndrome. The author is fighting hard against the Haskell compiler to fit a square peg in a round hole. Programming languages are just tools, not religions. Programming languages are fake abstractions of the machine. When one doesn't work for you, you should have no problem switching to another. Loyalty to literally an arbitrary man-made mental model at the expense of your real life productivity makes zero sense.

pwm · on Oct 16, 2019

Funny how different people read this differently. I read it as a playful exploration of how far can you push your square peg into that round hole. No one uses Haskell as a faster C irl, that would not make much sense, that’s not where the strength of the language lies.

simias · on Oct 16, 2019

There seems to be an unhealthy obsession with beating C in some corner of the Haskell community. It makes even less sense nowadays when some of the most popular languages out there are Javascript and Python, clearly C-grade performance is not required anymore by the vast majority of applications.

I actually wrote a comment (in 2013, but the page I'm referencing doesn't appear to have changed one bit) where I shared my experience looking at haskell.org's introduction that was filled with FUD regarding C and how Haskell was so much better and faster:

https://news.ycombinator.com/item?id=5090808

Maybe if they had actually tried to teach me the language instead of that bad faith "used car salesman" tactic I'd be writing Haskell these days.

tom_mellior · on Oct 16, 2019

> There seems to be an unhealthy obsession with beating C in some corner of the Haskell community.

I think it's about the same for pretty much any language in the "statically typed, compiled" camp. You'll see "faster than C" claims for Rust and Go as well, for example.

simias · on Oct 16, 2019

That's true. I guess what makes Haskell stand out to me is that at this point it's fairly clear that the assertion is mostly wrong (i.e. idiomatic Haskell is not faster than idiomatic C at most algorithmic tasks, and often significantly slower) and instead of doing the reasonable thing and saying "additional safety and expressiveness is more important than raw performance for most applications" (something I'd completely agree with personally) they double-down with these meaningless, unfair benchmarks.

Meanwhile Rust is genuinely competitive with C performance-wise and thanks to generics and things like more aggressive inlining can actually equal or even beat C without requiring esoteric micro-optimizations. Of course the drawback is that writing Rust code is significantly more complicated and more restrictive.

Annatar · on Oct 16, 2019

And the really funny part for us C practitioners is, we know we'll rarely be as fast as machine code generated from Fortran.

Ahri · on Oct 16, 2019

I've immersed myself in the Haskell community for the past couple of years and while I'd agree that a few posts try to race a Haskell implementation against a C version, these are in the extreme minority.

I feel that you have mis-characterised a whole community, very few of whom care to make Haskell faster than C and overwhelmingly share your view that "C-grade performance is not required".

Further, in the posts I've read of people comparing to C, if anything C is held up as the benchmark as a mark of respect - not as something to cheat against, not to try to say that Haskell is better. Indeed Haskell's FFI out to C is really nice, and demonstrates the pragmatic side of this ivory-tower language :)

This post first embraces its clickbait article and then highlights that even though the Haskell version may be faster on multicore systems (at least for ascii - the unicode version may be incorrect wrt. unicode spaces), but that in practical situations without caching this may not provide any benefit. You're acting as if OP submitted his binary to Apple, *BSD and all the Linux distros with a pull request saying his is better! It's a fun post golfing some Haskell performance issues.

Annatar · on Oct 16, 2019

C-like performance is required more than ever: we have to have monster multicore machines with gigabytes of memory just to run any software. That's absolutely ridiculous.

MrBuddyCasino · on Oct 16, 2019

I have noticed this is a very human pattern. Things have their strengths, and they've got their weaknesses. But if your identity is bound too much to your favourite thing, you have to bend reality in order to make the weakness disappear.

Or as Feynman put it:

The first principle is that you must not fool yourself — and you are the easiest person to fool.

psychoslave · on Oct 16, 2019

The article starts directly with a warning that the title is a bit provocative.

The main point here is clearly not to sell the existing tool on the sole consideration of binary execution performances. As the article say, with Haskell you can produce more reliable software, thanks to a strong typing, and use higher abstraction levels. Even with a significant loss in performance, this can be a good trade off (depending on the project type of course).

So the point here is that you can have descent performance with a state of the art "high-level garbage-collected runtime-based language".

erichocean · on Oct 16, 2019

> is a bit provocative

"is wildly incorrect" would be more accurate.

psychoslave · on Oct 16, 2019

That sounds like a rather common strategy for making provocative statements, doesn't it?

bonzini · on Oct 16, 2019

Also, when counting words the C code is actually handling multibyte sequences, which is one of the biggest sources of improvement for the Haskell code (8s to 3.5s).

Having to use a syntax extension explicitly to forgo laziness feels like cheating to me... But then I have never used Haskell so perhaps that's actually normal.

_pvxk · on Oct 16, 2019

If you're talking about BANGPATTERNS, that is completely normal in Haskell.

Pretty much any non-trivial Haskell program (and many trivial ones too) will have !'s in various places to force something to be strict, and to use the ! syntax you do need to add a pragma. Those pragmas are what let Haskell evolve new features without breaking backward compatibility, so Haskell programs tend to have a lot of them. Perhaps one day we'll have a new standard (it's been 19 years since the last one) which will include some of the pragmas by default. In the meanwhile, it's possible to turn the common ones on project-wide and just forget about them.

klodolph · on Oct 16, 2019

> …it's been 19 years since the last one…

Does Haskell 2010 not count?

_pvxk · on Oct 16, 2019

doh! 9 of course, thanks

pcstl · on Oct 16, 2019

Haskell is a lazy-by-default language and the most practical way to disable laziness is that syntax extension.

You can do it without the extension, but it'll turn ugly really quick (lots of uses of the "seq" function everywhere).

In general, the Haskell community is very tolerant of language extensions (people will recommend you enable about a dozen of them in every project), so this is pretty standard.

tome · on Oct 16, 2019

> Having to use a syntax extension explicitly to forgo laziness feels like cheating to me... But then I have never used Haskell so perhaps that's actually normal.

The `seq` function achieves the same end and is in the standard library. There's nothing cheating about it. The bang patterns extension just provides some nicer syntax for it.

ummonk · on Oct 16, 2019

In fairness, throwing multiple cores at it would require a lot more boilerplate in C.

Whereas for single-core the C code would be simpler than the optimized single-core Haskell code presented here.

vkazanov · on Oct 16, 2019

Structured concurrency with OpenMP can be very convenient at times... :-)

gpderetta · on Oct 16, 2019

#pragma omp for

Goes a long way.

ummonk · on Oct 16, 2019

Hah, I take it back then...

InvOfSmallC · on Oct 16, 2019

What is the cost of parallelism of C though?

rootw0rm · on Oct 16, 2019

spinning up threads and context switching, just like the cost for everyone else?

chii · on Oct 16, 2019

The Haskell version is provably going to work without issue when parrellelized (due to monoid laws), but I doubt the concurrency primitives in C is as trivially usable, and I expect no sane person would want to write that version.

cormacrelf · on Oct 16, 2019

There's nothing stopping you from following the monoid laws in C. There is also nothing stopping you from breaking the monoid laws in Haskell.

sasasassy · on Oct 16, 2019

OpenMP IS trivially usable.

keldaris · on Oct 16, 2019

Why? This is not a hard problem and there's nothing wrong with parallelism in C. I'd much rather work on parallelizing the C code than deal with any amount of Haskell. I suspect there are more people who would agree with me than the total amount of people happy to write Haskell.

raverbashing · on Oct 16, 2019

Whatever the cost in C, Haskell is incurring the same cost, unless it is using green threads.

(You could use green threads in C - or something similar, though to be honest counting words in a file is a bit of an annoying problem for parallelism)

pcstl · on Oct 16, 2019

Haskell does use green threads. The thing is, Haskell code is managed by a Haskell runtime, and each runtime manages a single (processor) thread. So the usual parallelism model for Haskell is to spawn a runtime for each processor core and then let each runtime manage its own green threads.

test9753 · on Oct 16, 2019

Beating Decades of Optimized C with 27 Lines of Ocaml:

  type t = { mutable words: int; mutable chars: int; mutable lines: int; mutable in_word: bool ; mutable finished: bool};;

  let () =
    match Core.Sys.argv with
    | [| prog_name |] -> Core.Printf.eprintf "Usage: %s file1 file2 ...\n" prog_name
    | _ -> (
      let args = Core.Array.slice Core.Sys.argv 1 @@ Core.Array.length Core.Sys.argv in
      let buf_size = 65536 in (* 64 KB -> Caml IO buffer size *)
      let buf = Core.Bytes.create buf_size in
      Core.Array.fold args ~init:() ~f:(fun _ file ->
        Core.In_channel.with_file file ~f:(fun in_ch ->
          let c = { words = 0; chars = 0; lines = 0; in_word = false; finished = false } in
          let set_words () = if c.in_word then c.words <- c.words + 1 in
          while not c.finished do
            let len = Core.In_channel.input in_ch ~buf ~pos:0 ~len:buf_size |> Core.Int.to_int in
            if len > 0 then (
              for i = 0 to (len - 1) do
                match (Core.Caml.Bytes.get buf i) with
                | ' ' | '\t' | '\r' -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false)
                | '\n'              -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false; c.lines <- c.lines + 1)
                | _                 -> (c.chars <- c.chars + 1; c.in_word <- true)
              done
            ) else ( c.finished <- true )
          done;
          set_words ();
          Core.Printf.printf "%s -> lines: %d, words: %d, characters: %d\n" file c.lines c.words c.chars)))
  ;;

Testing:

  $ ls -lh test_file.txt 
  -rw-r--r--  1 user  group   508M Oct 16 12:01 test_file.txt

  $ time wc test_file.txt 
  4863460 54621760 532480000 test_file.txt

  real 0m3.368s
  user 0m3.137s
  sys  0m0.195s

  #Ocaml version:
  $ time ./wcl.native test_file.txt 
  test_file.txt -> lines: 4863460, words: 54621760, characters: 532480000

  real 0m2.480s
  user 0m2.345s
  sys  0m0.123s

geocar · on Oct 16, 2019

Oh this is fun! Beating Decades of Optimised C (best of three)

    $ time wc w.txt
     3156098 6312196 380648004 w.txt
    
    real 0m1.358s
    user 0m1.277s
    sys 0m0.066s

... with one line of q (best of three):

    q)\t g:{(sum[1<deltas where x]+sum[not x:max x],0),sum max x 0 1}0x0d0a2009=\:read1 `:w.txt
    783
    q)g
    380648004i
    6312196
    3156098i

783msec: almost twice as fast!

yiyus · on Oct 26, 2019

Won't this give an incorrect number of bytes for several consecutive whitespace characters? Why not using # to count the bytes instead?

tluyben2 · on Oct 31, 2019

in k, but with that same bug';

(b:#a;b++/,/a=/:" ";b+#,/a)

tluyben2 · on Oct 31, 2019

Here is one without the bug;

b:#a;c:,/a=/:" ";(b;b++/c@&~c&=':c;b+#,/a)

Now to see if I can check the performance.

yiyus · on Oct 31, 2019

We discussed this at length in the shakti mailing list [1]. This version by chrispsn is my favorite:

    {+/["\n"=x],+/[<':~x in "\n\r\t "],#x}

I think it's the one that more clearly express the intent. However, this other one by Attila Vrabecz performs better in the current version:

    {(+/x in"\n\n";+/0>':x in" \n\t\r";#x)}@1:"big.txt"

[1] https://groups.google.com/forum/?utm_medium=email&utm_source...

tluyben2 · on Oct 31, 2019

Nice, thanks for that!

gmfawcett · on Oct 18, 2019

That is quite beautiful. It may also have awoken Cthulhu, but that's a separate issue. :)

bonzini · on Oct 16, 2019

The system wc is multibyte friendly, the Ocaml version is not.

vbarrielle · on Oct 16, 2019

While I wouldn't be surpised that OCaml is fast, I'd like to be sure your disk cache was warm in both cases. I'd test it myself, but I don't have OCaml setup on my work computer.

vbarrielle · on Oct 16, 2019

I setup it on my computer, used it to count the lines of 250 files, it is indeed faster than my system's (ubuntu 18.04) wc, but it doesn't compute the total. I don't think this should change the timings much though.

A best of five run of the system's wc takes 0.042s, vs 0.028s for the OCaml version.

gpderetta · on Oct 16, 2019

Nice.

I was going to say that it is unreadable, after spending a bit trying to understand it, (for someone that knows no ocaml) it actually is not. Only things I'm unsure are the ~ symbol and <-. What do they do?

edit: <- is just mutating assignment. I thought that ~ introduced named arguments but doesn't seem right.

pcstl · on Oct 16, 2019

~ does indeed denote named arguments. To call a function named f which has an argument named arg with arg set to the value x, the syntax is:

    f ~arg:x

gpderetta · on Oct 16, 2019

But what about ~buf here:

  let len = Core.In_channel.input in_ch ~buf ~pos:0 ~len:buf_size

test9753 · on Oct 16, 2019

~buf is shorthand for ~buf:buf

gpderetta · on Oct 16, 2019

Oh, nice.

edflsafoiewq · on Oct 16, 2019

A naive C version is also faster than the system wc.

rubystallion · on Oct 22, 2019

Why was system wc implemented in a slower way than a naive C version?

igouy · on Oct 24, 2019

Once upon a time

https://salsa.debian.org/benchmarksgame-team/archive-alioth-...

https://web.archive.org/web/20010616231931/http://www.bagley...

galfarragem · on Oct 16, 2019

Could somebody write this in Clojure? I'm too noob to do it myself. Will it be an order of magnitude slower?

weavie · on Oct 16, 2019

Many languages will have counted a 1gb file before Clojure has even started up.

maxresdefault · on Oct 17, 2019

> Many languages will have counted a 1gb file before Clojure has even started up.

You mean before the JVM has started up. One could compile Clojure with GraalVM to eliminate startup time.

LeonidasXIV · on Oct 17, 2019

Not really, the initialization of the JVM is fast(er), but initializing the Clojure runtime took a lot of time. At least in 1.7, not sure if it got better in the meantime.

I could start and run Java programs way faster than just starting the Clojure REPL.

Noe2097 · on Oct 16, 2019

What a clickbait indeed. The end version is still eating 3 times more memory, but the worst is that it just doesn't work in the general case.

The way I generally use "wc" is inside a somewhat complex command line, with commands preceding it. As in "feeding characters" to it.

There is a reason why "wc" is not multithreaded: it just can't. It must work sequentially, because in the general case the input of "wc" cannot be skipped over.

This is one of the two big assumptions that are made by the author ("wc works only for files, so we can lseek") -- the second, identified by the author, being that the underlying hardware and filesystem must support concurrent access to the same location efficiently.

orlp · on Oct 16, 2019

> There is a reason why "wc" is not multithreaded: it just can't. It must work sequentially, because in the general case the input of "wc" cannot be skipped over.

Eh what? Word counting is embarrassingly parallel, at least for files.

You start up k threads each working on its own chunk of the file, assuming that there is a word boundary at the start of its chunk. Then you sum up the total word count but inspect whether the chunk boundaries actually were word boundaries, and subtract 1 each time this was violated.

foxhill · on Oct 16, 2019

> Eh what? Word counting is embarrassingly parallel, at least for files.

wc does not make the presumption that it is working on a file that can be seeked into (and, more importantly, re-wound). in c++ terms, it behaves as though the source it is reading from is an "input iterator"[1].

if you are piping into wc, then you'll never be able to parallelise it. word counting a file is the edge case here, not the general use.

[1] - https://en.cppreference.com/w/cpp/named_req/InputIterator

vidarh · on Oct 16, 2019

You don't need to seek: Spin up workers, have them wait on a mutex, and read blocks. If the producer is slow, then you won't need to feed multiple workers. If it's fast, it's a matter of finding the tradeoff between block size vs. the processing time per block.

foxhill · on Oct 16, 2019

i expect you're being downvoted because it's likely that the cost of deciding which thread to assign work to is an order-of-magnitude more expensive than accumulating the reqiured counters.

in any case: you're right, it is of course possible. what i should have said was:

> if you are piping into wc, then you'll never see a performance gain from any attempts at parallelisation.

vidarh · on Oct 17, 2019

Well, I'm back at 1 point now. In any case, the amount of work needed to assign to a thread is trivial. For a small number of threads, it at most requires a couple of mutex operations and a couple of load/stores to distribute each block. Given the block size can be set arbitrarily high, the overhead of work assignment per byte can be made arbitrarily low. The biggest issue is that you won't know the input size, and for small input sizes there will be a cutoff point where multiple threads adds cost. But for small input sizes, performance won't be an issue anyway.

EDIT: Actually, the simplest way of assigning work is probably just to have each thread try to acquire a mutex, call read(), and then release the mutex and do their work.

Noe2097 · on Oct 16, 2019

> work counting is embarrassingly parallel, at least for files.

True; and "wc" works not just on files. "wc" also works on character devices, on streams of data that cannot be randomly accessed.

heinrich5991 · on Oct 16, 2019

I think your parent meant that you can't do that if you get your data from a pipe. You can't `lseek` on a pipe.

taneq · on Oct 16, 2019

The sentence before your quote explains why that's not always an option.

(Of course wc could switch to a multithreaded algorithm while working on files, but the point did make sense.)

easytiger · on Oct 16, 2019

Pipes.

Used Unix every day for 20 years and I'd say 90% of WC use cases is read from a pipe

zelly · on Oct 16, 2019

It's absurd that something as toyish as wc would ever need multiple cores, even if possible by some hacky means.

krick · on Oct 16, 2019

It isn't absurd, if it achieves better performance on sufficiently large real-life cases. (Cannot remember if I actually needed that though.)

cosarara · on Oct 16, 2019

Last week he had "Learning Haskell is no harder than learning any other programming language"[1]. This week we have:

> The program takes more than a few minutes and quickly spikes up to more than 3 GB of memory! What's gone wrong? Well, we used the strict version of foldl (indicated by the trailing tick '); BUT it's only strict up to "Weak Head Normal Form" (WHNF), which means it'll be strict in the structure of the tuple accumulator, but not the actual values!

As easy as any other programming language!

[1] https://news.ycombinator.com/item?id=21170547

skocznymroczny · on Oct 16, 2019

A monad is just a monoid in the category of endofunctors, what's the problem?

Jestar342 · on Oct 16, 2019

I wish I knew enough to appreciate this joke.

NullPrefix · on Oct 16, 2019

http://james-iry.blogspot.com/2009/05/brief-incomplete-and-m...

KineticLensman · on Oct 16, 2019

> 1957 - John Backus and IBM create FORTRAN. There's nothing funny about IBM or FORTRAN. It is a syntax error to write FORTRAN while not wearing a blue tie.

Not the same joke, but really rather awesome.

pcstl · on Oct 16, 2019

You really don't need to know more than "it's technically true, but also completely useless for anything practical".

non-entity · on Oct 16, 2019

I'm still a bit starstruck by mildly comical terms like zygohistomorphic prepromorphism

gpderetta · on Oct 16, 2019

/me googles the phrase

Oh, you didn't make those up!

dhagz · on Oct 18, 2019

Welcome to functional programming.

lgas · on Oct 16, 2019

If it helps, a category is just a monad in the bicategory of spans.

erichocean · on Oct 16, 2019

This will never get old...

endgame · on Oct 16, 2019

Only if you choose the right tensor product for your monoidal category of endofunctors.

But seriously, nobody says this in earnest and posting it doesn't help the discourse.

dgellow · on Oct 16, 2019

I believe that was the joke.

abhir00p · on Oct 16, 2019

I believe in the cited article the author was simply pointing out to the fact that it is easy to learn the syntax and basic semantics of the language.

It is extremely hard to write high performance Haskell, same as it is hard to write high performance C. And additionally I think it is foolish to claim that "Haskell is no harder to learn than any languages". Definitely there are certain misconceptions (like you need to know category theory to learn Haskell) which needs to be debunked, but it is quite difficult to grasp the evaluation semantics of Haskell and pretending that it is easy and natural will simply shove more users away from the language.

philzook · on Oct 16, 2019

Huge Haskell fanboy here. True high performance, fair enough that's hard everywhere. I think it is unlikely a novice C programmer would accidentally balloon their memory like this. The simplest Haskell program that a novice might write has abysmal performance.

I think it is easier to write clear, succinct code in Haskell and easier to write decently fast code in C. Neither is impossible in either.

runeks · on Oct 16, 2019

The previous article spoke about learning enough Haskell to write useful programs. It didn’t speak about learning enough Haskell to do performance optimization. I don’t think there exists a language where it’s enough to learn the surface language in order to write optimized code.

Also, the previous article recommends newcomers to use the “Strict” language extension, which would avoid the exploding memory usage of the second version (“simple-fold-wc”).

roca · on Oct 16, 2019

You're implicitly assuming that there are only two classes of problems: ones where performance doesn't matter at all, no matter how awful, and ones where one is willing to expend unlimited amounts of wizard-level optimization effort to achieve maximum performance.

On the contrary, in practice, it is important that the obvious code a programmer would write should be reasonably efficient (let's say, within an order of magnitude time and space of the optimal single-core solution). For this (very simple problem), this article makes it clear that is not true in Haskell. In contrast I believe it would be true in Rust and even Python (without having written the programs, but using BufReader::lines() and String::split() in Rust and the lines iterator and split() in Python).

It's also a problem for Haskell that the first thing you do for performance is to disable one of the key features that distinguishes Haskell from other languages.

xiphias2 · on Oct 16, 2019

Rust is generally fast-by-default. My favourite example is the collections API, where you can do mapping/reducing/filtering operations only by inlining multiple operations and going through the collection only once.

With Haskell these things are possible, but not efficient by default (the default Haskell library uses liked lists instead of vectors for example).

Annatar · on Oct 16, 2019

"I don’t think there exists a language where it’s enough to learn the surface language in order to write optimized code."

Good news, there is indeed such a language: Fortran. Modern Fortran syntax is similar to Python (but without indentation restrictions) and the compiler generates wicked fast machine code.

Tade0 · on Oct 16, 2019

I don't think he problem here is necessarily with the language.

Because Haskell is not just a language - it's a whole culture and by learning it you have to adopt that culture. That's the degree of mental shift you have to go through to really understand it.

Ultimately you don't become a Haskell programmer, but a Haskell person.

quickthrower2 · on Oct 16, 2019

Exactly. Do people complain about skiing being “hard” compared to walking? Or cycling for that matter? Nope. (Well kids will complain about falling off their bike once or twice)

Is skiing even hard? Crawling -> Walking is arguably a lot harder.

nuxi · on Oct 16, 2019

"Keep calm and curry on", eh? :)

chii · on Oct 16, 2019

Every language has their esoteric quirks.

Tell me that it isn't weird to write this https://en.wikipedia.org/wiki/Duff%27s_device

Tell me it isn't strange that array access is the same as pointer deref (aka a[index] is same as index[a])

cosarara · on Oct 16, 2019

The Duff's device is a low level optimization that can even hurt modern processors due to branch prediction. It's a niche trick from the olden days.

It is very strange on first sight that a[index] is same as index[a], but you will never see the second one used in practice. When you are learning, you can write your code using normal array access syntax and solve problems. And then, you can learn to use pointers and deref them. And then one day you might realize that x[y] is *(x+y) and thus y[x] and that it looks funny. But that will be a long time after writing `wc`.

sifar · on Oct 16, 2019

How would this work for unaligned accesses - which index would very likely be. A neat way to look at things nevertheless, thanks.

cosarara · on Oct 16, 2019

The alignment doesn't matter. `index` is not an offset in bytes, it's just an index, a number. The position in memory of the variable also doesn't matter as it is not used, only the value.

   `index[arr]` means `*(index + arr)`.

Which is exactly the same thing as arr[index], no matter how you look at it. The type of arr is taken into account for pointer arithmetic.

Whatever works with `arr[index]` will also work with `index[arr]`. You can also use a constant, for example `2[arr]`.

Damn I hate HN formatting.

sifar · on Oct 17, 2019

yes, I realized this after asking the question :).I must admit I never thought it in this way.

Thinking more about this, this might fail if the compiler uses post-incrementing load/stores which update the pointer addresses too. Or will it not use them if I am using index[arr] ?

Lets say the load/store updates the pointer too. So after first execution, ptr will index+arr in both cases, but during the next iteration , the former will be ptr[index] while the latter will be ptr[arr].

tom_mellior · on Oct 16, 2019

Unless you try hard to provoke one, there is no unaligned access. Adding a pointer and an integer in C always displaces the pointer by elements, not by bytes. So if p is a pointer to some type T and is aligned, then p + 1 will be aligned as well.

You need to cast to another pointer type, do arithmetic on that, then cast back to produce an unaligned access.

heavenlyblue · on Oct 16, 2019

Does that even work in C when sizeof x[i] > 1?

cosarara · on Oct 16, 2019

Yes, it does. You can try it yourself with this code example: https://repl.it/repls/PutridUselessComputer

heavenlyblue · on Oct 16, 2019

It still made more sense to me more than foldl vd foldl’.

coldtea · on Oct 16, 2019

>Tell me that it isn't weird to write this https://en.wikipedia.org/wiki/Duff%27s_device*
Duff's device is an abuse of an obscure syntax option nobody expected, and it's also seldom used.
Foldr, on the other hand, is bread and butter in Haskell.

>Tell me it isn't strange that array access is the same as pointer deref (aka a[index] is same as index[a])*

I can tell you, it isn't strange.

klingonopera · on Oct 16, 2019

No, it is strange, because C decides to dereference according to the datatype of "a" in both cases, and not according to how the user syntactically declares what is to be understood as array and index.

In other words, if a is char[]/char* and index is int, C does

  *((char*)a + (int)index) and *((int)index + (char*)a)

and not as one could expect

  *((char*)a + (int)index) and *((int*)index + (char)a)

for a[index] and index[a] respectively.

simias · on Oct 16, 2019

I always understood it as some kind of compiler-level macro, where a[b] is expanded into <star>(a + b) and then normal arithmetic rules take over, in which case there's no strangeness as far as I can tell (i.e. it doesn't matter which of a or b is the pointer and which is the integer because C defines both "int + ptr" and "ptr + int" as doing the same thing).

But frankly it's a bad choice if we're talking about C's "esoteric quirks" because both this and Duff's Device are things that you effectively never encounter in the wild (it's the "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" of the C language, they're interesting to know about but not of much practical use).

There are much better "features" of C to single out, for instance the two million different meanings for the "static" keyword, the fact that arrays declared as function parameters do not behave like arrays but like pointers (hey, you can put "static" in there too now! But it still doesn't make it work like a proper array), the many surprising undefined behaviors regarding integer overflow, aliasing rules etc...

gpderetta · on Oct 16, 2019

Adding a pointer and an index yield a pointer no matter which side the pointer is, i.e operator+ is commutative. It would be extremely surprising if switching sides of the sum would implicitly convert an index to a pointer.

You could argue that operator[] should not allow a non pointer on the LHS, and you would probably right, but as it stands now, it is just syntactic sugar for operator+ and just dereferencing the while expression.

chii · on Oct 16, 2019

understanding pointer dereferencing and its meaning is no different than understanding lazy evaluation and strict evaluation in terms of difficulty. I would argue that lazy and strict is easier a concept to understand.

The reason a lot of people think that haskell's lazy/strict behaviour is quirky is only because they are not familiar with functional programming concepts.

Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array (zig-zagging one way is faster than the alternative). But this info is just as difficult to have learnt, 'cept that as a seasoned C programmer, you would've likely to have encountered this already.

coldtea · on Oct 16, 2019

>understanding pointer dereferencing and its meaning is no different than understanding lazy evaluation and strict evaluation in terms of difficulty. I would argue that lazy and strict is easier a concept to understand.

Not even close. Pointer and dereference are something we do all the time in the real world (e.g. a pointer is like the semantic between word and thing) and can have easy to understand examples (the classic storage box with holds a value, and the pointer being the number of the box, etc). Still tons of programmers can't get their head around them...

Lazy/strict is far more challenging, and whereas pointers have no mystery after understanding the basic mechanism, telling how a lazy program behaves and spends memory etc is nigh impossible above a certain complexity -- heck, this is a common lament of seasoned Haskell programmers.

>Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array (zig-zagging one way is faster than the alternative).

Many bat an eye. Heck, many not even get pointers, much less cache lines, and optimal access/sharing patterns...

gpderetta · on Oct 16, 2019

> Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array

the vast majority of programmers won't have any idea about any of that. Arguably (not by me BTW) they don't need.

edit: also, caches are a necessity of modern (and not so modern) computer architectures, every program has to deal with them. Laziness is sort of a self inflicted weakness of Haskell.

edit2: I can't write any Haskell, but I can read a bit of it (like python, it makes for excellent pseudocode I think). Sometimes I fail to understand how some algo could possibly work, but then 'ah, right, laziness'. I'm sure it comes natural after a while, but it is definitely a barrier.

simias · on Oct 16, 2019

>Nobody bats an eye when an article talks about cache lines and how you organize data in a multi-dimensional array (zig-zagging one way is faster than the alternative).

I don't follow, this is not really about programing languages and more about optimizing for a given architecture. Surely this problem would also arise if you're trying to optimize Haskell code? Or, for that matter, assembly, Go, Pascal or any language that lets you have low-level control on memory layout of your data.

enlyth · on Oct 16, 2019

> counting words > esoteric quirks

It shouldn't be this easy to shoot yourself in the foot when doing simple tasks

boomlinde · on Oct 16, 2019

Strange, but unsurprising after seeing the examples, IMO. Knowing that the memory spike in Haskell is due to the strict form of foldl (as indicated by the tick) only being strict up to "weak head normal form" seems quite a bit more esoteric to me.

But maybe I am biased due to learning more about how computer implementations fundamentally operate (in which case the only surprising thing about a[b] == b[a] is that the C standard allows it) than FP theory and how that abstraction can be baked down into something that the computer can actually execute efficiently.

mschaef · on Oct 16, 2019

> Every language has their esoteric quirks.

But most let you avoid them in favor of concepts manageable by an average middle schooler.

UserIsUnused · on Oct 16, 2019

And haskell does this. This is way over engineering for performance. Different level. It's the same in other language when you go deep into performance engineering, it just depends how deep considering if the algorithm fits better a functional or imperative model.

mschaef · on Oct 16, 2019

> It's the same in other language when you go deep into performance engineering

Take a look at the optimized code for the C version of wc:

https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

There are a couple special cases for a line count and a total character count, but it's essentially a couple of nested loops.

I agree with you that going deep into performance engineering brings out a lot of esoteric edge cases, but in this case for the C program it wasn't even necessary to go down that path. Basic concepts in C outran or at least matched something heavily optimized in Haskell.

I'm a big fan of higher level languages, and use them almost exclusively myself in my day job, but there's something to be said for basic constructs that just run quickly out of the box. It may be harder to optimize the C ode, but you're also less likely to need to do so.

cmroanirgo · on Oct 16, 2019

Disclaimer: old timer (not sure if you whippersnappers do the same anymore).

The few times that I really needed performance out of a particular C/C++ function, I'd enable the compiler's output to generate ASM, and then see what was being generated, and either rework the code (to keep it readable for the next guy), or optimise directly with ASM, and often an iterative mix of the two. Loop unrolling (as in Duff's device) was a common one for array iteration performance improvements. I don't see a lot of "optimized C" in the wc impl. I'd imagine that wc optimised for use in a GPU would be pretty spectacular, even beyond using a few ordinary cores like the Haskell result did... but as you say, it's pretty unnecessary in this case.

mschaef · on Oct 16, 2019

> Disclaimer: old timer (not sure if you whippersnappers do the same anymore).

Heh... I don't know that I qualify as an 'old timer', but it's been a while since 'whippersnapper' would've applied to me. :-)

> The few times that I really needed performance out of a particular C/C++ function

For me, at least, performance has almost never been a problem, particularly when writing in C.

> I don't see a lot of "optimized C" in the wc impl. I'd imagine that wc optimised for use in a GPU would be pretty spectacular,

Whenever I think of optimized C code, I immediately think of the work done to make Gnu Grep fast.

https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

A lot of that boils down to picking an algorithm that lets them defer as much work as possible until you know you need to do it. I suppose that's one of lazy evaluation's claims to fame, but I think the level reached in Grep would be difficult to achieve in a more modern functional lannguage.

cmroanirgo · on Oct 16, 2019

I was writing rasterisers, image manipulators and various matrix/vec math operators for 2d and 3d. Every clock cycle counts in some situations.

roca · on Oct 16, 2019

Normal programmers never write those things, and one practically never sees them in real life, so they're not relevant.

quickthrower2 · on Oct 16, 2019

WHNF is no harder to understand than JS promises IMO.

jacobush · on Oct 16, 2019

But no one pretends Javascript is not batshit insane. Right?

1ris · on Oct 16, 2019

That's a quirk of the standard library. I don't think you need to know this in order to have practical knowlege of the language.

Today I saw somebody implementing C++'s std::is_integral in C# and the attempt to generate random values of generic type. It wasn't pretty, but stuff like this does not make C# a hard language.

tirumaraiselvan · on Oct 16, 2019

You don't have to get into the intricacies of laziness when you are learning the language IMHO.

otabdeveloper4 · on Oct 16, 2019

Yes, if you want your shells scripts to consume 3 gigabytes of memory.

cosarara · on Oct 16, 2019

Unless you are trying to write `wc` without iterating the file 3 times, it seems.

runeks · on Oct 16, 2019

How is that related to laziness?

cosarara · on Oct 16, 2019

The article explains that.

runeks · on Oct 16, 2019

No, laziness has nothing to do with iterating over the file multiple times. If you perform three separate operations on the same file (as the first, naive version does), you will iterate over the file three times regardless of the evaluation strategy.

The high memory usage is related to the lazy evaluation of the arguments to the “go” function. When these arguments are evaluated strictly, iterating over the file three times would increase running time by a factor of three, but the memory usage would still be low (ie. roughly proportional to the file size).

cosarara · on Oct 16, 2019

What I quoted is talking about the 2nd version of the code, that tries to count without iterating three times and then encounters a laziness problem.

The first version iterates three times and the reason it takes long has nothing to do with lazyness, we agree on that.

runeks · on Oct 16, 2019

I see your point now. My mistake.

In order to avoid iterating over the file three times, he writes the second version, which does require understanding laziness (in order to avoid high memory usage).

I stand corrected.

UserIsUnused · on Oct 16, 2019

that has nothing to do with programing language, but with tools/libraries.

Haskell is one of the simplest programming languages, but judging it by that statement is judging java for spring2 AbstractSingletonProxyFactory bean, ruby for all of rails monkey patching and magic, etc.

antisemiotic · on Oct 16, 2019

How is it so? Lazy evaluation (and thus the concept of a (weak-head) normal form, and the possibilty of space leaks) is pretty fundamental to Haskell. It's also impossible to write generic folds (i.e. without adding an NFData constraint, which comes from outside the standard library) that evaluate further than WHNF. And just making all things may lead to worse performance or even non-termination in some cases, anyway.

Cieplak · on Oct 16, 2019

Writing Haskell is almost trivial in practice. You just start with the magic fifty line {-# LANGUAGE ... #-} incantation to fast-forward to 2017, then add 150 libraries that you’ve blessed by trial and error to your cabal file, and then just write down a single type signature whose inhabitant is Refl to your program. In fact if your program is longer than your import list you’re clearly doing Haskell all wrong.

Stephen Diehl — Reflecting on Haskell in 2017

Matthias247 · on Oct 16, 2019

To be fair for the C version: The linked code [1] doesn't really look like "decades of optimized C" to me. More like a straightforward implementation of this, which had been likely had not seen many updates for decades.

It could probably gain performance by playing around with input buffer sizes and trying to make use of vectorized instructions (the compiler might try to auto-vectorize the lops, but there is no guarantee).

[1] https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

abainbridge · on Oct 16, 2019

Exactly this. I see this as "Simplest possible C code beats carefully hand optimized Haskell".

saagarjha · on Oct 16, 2019

> the compiler might try to auto-vectorize the lops

It does on my computer! The wc binary on my machine has movaps instructions.

peteretep · on Oct 16, 2019

> We're still orders of magnitude away from wc unfortunately

Just a reminder that “orders of magnitude” means something specific, rather than “quite a bit”, and 3 is not “orders of magnitude” more than 0.3, and neither is 5.48 “orders of magnitude” more than 1.86

sclangdon · on Oct 16, 2019

Actually, 3 is two orders of magnitude (or just 1 depending on whether you count 1 as an order of magnitude or not) more than 0.3

0.3 = 3 * 10^-1, which is -1 order of magnitude

3 = 0.3 * 10^1, which is 1 order of magnitude

You are right that 5.48 is not an order of magnitude more than 1.86, however. They are the same order of magnitude.

murderfs · on Oct 16, 2019

> (or just 1 depending on whether you count 1 as an order of magnitude or not

Of course you don't count 1: otherwise 3 would be 1 order of magnitude greater than 3, which is obviously wrong.

peteretep · on Oct 16, 2019

Mate, 3 is a single order of magnitude more than 0.3, not two.

tom_mellior · on Oct 16, 2019

I grabbed the source linked from the article, made it run on Linux, compiled it with gcc -O3, and it beats my system's wc by a factor of almost 2x. C is faster than C!

System wc, fastest of 5 runs:

      59180000  213180000 1407437834 wc_input
    11.53user 0.19system 0:11.74elapsed 99%CPU (0avgtext+0avgdata 2428maxresident)k
    0inputs+0outputs (0major+102minor)pagefaults 0swaps

Manually compiled wc, fastest of 5 runs:

     59180000 213180000 1407437834 wc_input
    6.08user 0.20system 0:06.29elapsed 99%CPU (0avgtext+0avgdata 2180maxresident)k
    0inputs+0outputs (0major+92minor)pagefaults 0swaps

This is on Ubuntu 18.04 on x86_64. If the author's wc also came precompiled (which it seems to be), they cannot make any valid comparison at all. If manually compiling on their machine with maximum optimization would also get them a 2x speedup, their fancy multicore Haskell would still be slower than C... and that's fine.

z92 · on Oct 16, 2019

Title should be "Beating Decades of Optimized Single Threaded* C with 80 Lines of Multi Threaded Haskell"

That's not what I expected to read.

dig1 · on Oct 16, 2019

I find funny this "beating C with X" attempts and articles, popular since the dawn of computing.

Although all of them are trying hard to be a little bit faster (and shorter) than C code, this is usually the price they have to pay:

1. Abysmal compilation time.

2. Huge binaries. Sure, C static binaries aren't small, but much smaller than equivalent Go programs.

3. Complex language. C is still simple language (sans edgy cases). No need to dive into type theory, learn about functors and so on just to write something.

4. Complex implementations (compilers, garbage collectors...). Not something you can hack over the weekend.

5. Complex distribution. C natural environment is Unix/Linux and libc/glibc is always there. Everything else requires additional (usually tons of) libraries or static linking.

6. Language instability. Because of 3), these languages are always evolving. You can read and compile C code 20-30 years ago and I'm expecting the same for the next, at least, 10 years.

_bxg1 · on Oct 16, 2019

Very interesting, and a simple enough use-case that a Haskell novice can get the gist of it. It's very informative of how real Haskell development works; it's a brass-tacks kind of example which you don't necessarily see a lot of in the Haskell world.

zelly · on Oct 16, 2019

Every ”X new thing I made on a weekend is faster than Y old thing” is inevitably a half-implemented version of the old thing. Of course doing less instructions is going to be faster. You could also make a benchmark-winning web server that just write()s to a socket, skipping all sanitization and error handling.

djmips · on Oct 16, 2019

Is this code that they are beating actually 'hand-optimized C'? Because the C code at a glance I didn't get that impression.

yxhuvud · on Oct 16, 2019

Yes, the C is much too readable. Also I wonder how fresh it is - the Macs have been known to run some really old versions of GPL tools due to license issues with newer versions.

saagarjha · on Oct 16, 2019

2008, and I'll have you know that it's

  Copyright (c) 1980, 1987, 1991, 1993
      The Regents of the University of California.  All rights reserved.

jdc · on Oct 16, 2019

Where would your hand-optimized C differ?

rini17 · on Oct 16, 2019

For a start, getc() isn't a simple macro accessing read buffer anymore, but a function which always acquires/releases a lock (even in single-threaded program). Changing it into getc_unlocked() usually makes big difference.

http://man7.org/linux/man-pages/man3/unlocked_stdio.3.html

bonzini · on Oct 16, 2019

The linked code uses read(), not getc().

Smithalicious · on Oct 16, 2019

> Basically we need to count the number of times a given invariant has changed from the start to the end of a sequence.

I can say with confidence that the invariant will change exactly 0 times. No clue what the author is trying to say here.

zmix · on Oct 16, 2019

Hasn't Pandoc[1] been written in Haskell? The only real slowdown I experience with it is startup time.

    > If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

So, Pandoc does a lot of text processing.

[1] https://pandoc.org/

UweSchmidt · on Oct 16, 2019

In general terms, how is it possible to beat really optimized C code? In particular, what property of Haskell or any other functional language would make that possible (in theory or practice).

ummonk · on Oct 16, 2019

The C code is not particularly optimized, but more importantly, C being a low level language makes writing multi-threaded programs much more work. wc though cannot make use of multi-threading regardless because it must have the ability to work on streams, not just seekable files, as the haskell code here does.

Annatar · on Oct 16, 2019

C is a high level language with capability of low level access: we regularly made fun of C programmers in high school (like my math teacher) because they didn't know how to code in assembler and had to write abstract high level code in C, in order for the compiler to do it for them. The biggest brouhaha was always disassembly of "optimized" machine code generated by the C compiler, full of needless stack twiddling and housekeeping code, wasting hundreds or even thousands of CPU cycles on what amounts to useless abstractions. That's nonsense one finds only in a high level language. But it's always super for a good brouhaha!

ummonk · on Oct 16, 2019

We've long since moved on from an era when anything above assembly could be considered high level.

And the performance difference between idiomatic C code compiled with -O3 (to inline functions and minimize the stack twiddling and housekeeping code) and handwritten straightforward assembly is low enough as to be negligible. Not so with, say, Haskell vs C.

Annatar · on Oct 16, 2019

Neither is true. Well, the first might be true for you in your mind, but doesn't change the fact that C is a high level language modelling an abstract system, and providing that model through its standard libraries (which is what makes it possible to write portable code).

And the second with -O3 could not be more wrong, and I can prove it: compile anything with gcc -S -O3 and behold the extra code which a human would never write because it's unnecessary.

tom_mellior · on Oct 16, 2019

> I can prove it: compile anything with gcc -S -O3 and behold the extra code which a human would never write because it's unnecessary.

Since you claim that that's true for "anything", care to show us an example consisting of a 5-10 line C function and your assembly version?

Annatar · on Oct 18, 2019

Considering I wrote above exactly what to do (gcc -S), I believe you are not making this request in good faith; rather, you seek to test whether I know what I'm writing about. However, had you done what I told you to do, you wouldn't have asked. That's why I think you are actually trolling me.

tom_mellior · on Oct 18, 2019

Input code:

    float dot_product(float *a, float *b, int n) {
        float result = 0.0f;
        for (int i = 0; i < n; i++) {
            result += a[i] * b[i];
        }
        return result;
    }

Output from gcc -O3 -S:

        .file   "foo.c"
        .text
        .p2align 4,,15
        .globl  dot_product
        .type   dot_product, @function
    dot_product:
    .LFB0:
        .cfi_startproc
        testl   %edx, %edx
        jle .L4
        leal    -1(%rdx), %eax
        pxor    %xmm0, %xmm0
        leaq    4(,%rax,4), %rdx
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
    .L3:
        movss   (%rdi,%rax), %xmm1
        mulss   (%rsi,%rax), %xmm1
        addq    $4, %rax
        cmpq    %rax, %rdx
        addss   %xmm1, %xmm0
        jne .L3
        rep ret
        .p2align 4,,10
        .p2align 3
    .L4:
        pxor    %xmm0, %xmm0
        ret
        .cfi_endproc
    .LFE0:
        .size   dot_product, .-dot_product
        .ident  "GCC: 7.4.0"
        .section    .note.GNU-stack,"",@progbits

Where's the unnecessary code? What does your hand-optimized version look like? Is your hand-optimized version faster than this?

Annatar · on Oct 19, 2019

This is how a human would code the same (a human would know, for instance, that (s)he wants to multiply 50 times, passing that in %rax (that would be your "n" in C), something a compiler can never know, which is one of many reasons (and there are many!) why a compiler can never, ever beat a human at coding in assembler:

  DotProduct: subss %xmm0, %xmm0
  .Loop:      movss (%rdi, %rax, 4), %xmm1
              mulss (%rdi, %rax, 4), %xmm1
              addss %xmm1, %xmm0
              subq $1, %rax
              cmpq $0, %rax
              jnz .Loop
              ret

...potentially, depending on how 80x86 processors work, the Z flag might be set when subq $1, %rax reaches 0 and the cmpl $0, %rax could be optimized away, making the code even shorter (and faster). But I'll leave it in there because this handily beats the compiler in both size and speed, even with the compiler often trying to cheat with nopw instructions to warm up the data cache. Now compare that with the garbage GCC generated. At -O3 no less! What a bunch of garbage!

On a personal note, you forced me to write code for a processor architecture I do not even know (80x86 family of processors in 64-bit mode), at all, so you just assumed my primary system is powered by that intel / AMD garbage of processors, and I don't appreciate that one bit. And you doubted me, which I appreciate even less.

tom_mellior · on Oct 19, 2019

> a human would know, for instance, that (s)he wants to multiply 50 times

A human would turn the for loop that counts up into a do-while loop that counts down? That's doable on the C level, and in that case GCC also generates less ceremony:

    dot_product_2:
    .LFB1:
        .cfi_startproc
        pxor    %xmm0, %xmm0
        movslq  %edx, %rdx
        .p2align 4,,10
        .p2align 3
    .L8:
        movss   (%rdi,%rdx,4), %xmm1
        mulss   (%rsi,%rdx,4), %xmm1
        subq    $1, %rdx
        leal    1(%rdx), %eax
        testl   %eax, %eax
        addss   %xmm1, %xmm0
        jg  .L8
        rep ret

But of course if you do this you have changed the implied original spec, and you have reassociated a floating-point computation, computing different results. At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization.

> But I'll leave it in there because this handily beats the compiler in both size and speed

Not on my machine. On arrays of 1048576 (2 ^ 20) elements I didn't measure any difference between my first variant, the do-while variant and yours. On arrays of 1024 elements I still didn't measure a difference between the two compiled versions, but yours is about 1-2% slower. On 128 elements do-while looks a bit slower than for, and your variant is more than 20% slower.

Here are some numbers for n = 128, calling each function 10000000 times. dot_product is the original, dot_product_2 uses do-while, dot_product_3 is your code (with an extra move of edx to rax at the start of the function, of course).

    dot_product:   1.036514 sec
    dot_product_2: 1.097186 sec
    dot_product_3: 1.290411 sec

(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)

> even with the compiler often trying to cheat with nopw instructions to warm up the data cache

What?

> Now compare that with the garbage GCC generated. At -O3 no less! What a bunch of garbage!

You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?

> you forced me to write code for a processor architecture I do not even know

I didn't force you to do anything, I specifically asked you to provide an example of your choice. You could have chosen a piece of code where you knew in advance GCC would do badly. You could have chosen an architecture that you know well and/or one where you know that GCC's backend hasn't had as much work as x86-64. You chose not to do this. I didn't force you to do anything.

> And you doubted me

Based on the data it looks like I was right to doubt you.

Annatar · on Oct 20, 2019

"A human would turn the for loop that counts up into a do-while loop that counts down?"

Of course an experienced coder would do that; that's the entire point of why a compiler will never be able to generate as short and as fast of code as a human would. An experienced coder can see opportunities that a program with a hard-coded logic such as a compiler is incapable of seeing; if it were, we would have true artificial intelligence, with the neural network of a human brain or denser.

"That's doable on the C level,"

For some bizarre reason unbeknown to me, you just cannot seem to get your head around writing a program in assembler from start to finish. An experienced assembler coder looking for performance would not use a high level language like C because it would be a waste of her or his time, since they already know from experience that they can easily beat hard-coded program logic of a compiler (and if they did, they would chose Fortran over C). Programming in a high-level language only makes sense when the trade-off of losing performance is acceptable in the name of portability. Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?

"What?"

Some versions of GCC for intel 80x86 family of processors will generate one or more nopw instructions, the idea being to give the processor time to prime the data cache. It's a cheap gimmick of desperate compiler constructors, more of a gamble really, because cache games heavily depend on a very particular processor model in a processor family.

"At that point you might as well pass -ffast-math to GCC, and it will beat your code by 8x due to vectorization."

I could have vectorized this myself (that was actually the first thought I had) but chose not to because I didn't want to spend the time researching how to do that on the intel 80x86 family of processors; I could have also unrolled the loop about four to eight times for even more of a performance gain, but chose not to do so because I believe I made my point. Now you are just being purposely obtuse, because it seems that you just don't want to accept that high level compilers still aren't as fast as a human coder is: you chose a dot product because you likely knew that the compiler would generate pretty fast code, but that one isolated example doesn't really prove anything. If you do more comparisons over decades like I have, you will see how silly it was trying to argue that compilers generate faster code than coders.

Also, -ffast-math can generate code which will run the fastest only on that particular processor (and might generate instructions which will not run on other processors of the same family), but the far worse danger is that -ffast-math won't generate IEEE-754 compliant results. When one needs very high precision floating point arithmetic, -ffast-math cannot guarantee correct results. For these two reasons, -ffast-math is useless in real life applications. And indeed, wouldn't you know it:

     -ffast-math
         Sets        the         options         -fno-math-errno,
         -funsafe-math-optimizations,         -ffinite-math-only,
         -fno-rounding-math,       -fno-signaling-nans        and
         -fcx-limited-range.

         This    option    causes    the    preprocessor    macro
         "__FAST_MATH__" to be defined.

         This option is not turned on by any  -O  option  besides
         -Ofast  since it can result in incorrect output for pro-
         grams that depend on an exact implementation of IEEE  or
         ISO  rules/specifications  for  math  functions. It may,
         however, yield faster code  for  programs  that  do  not
         require the guarantees of these specifications.

"(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0 at the beginning. Removing that gets your code back up to only 1-2% slower than the for loop version.)"

There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction. On the Motorola MC680x0 family, both run at the same speed. Up until that time, I have never in my entire life written a single line of assembler code for the intel 80x86 family of processors. What you saw is the first assembler code I ever wrote for that platform; I even had to look up which registers to use for floating point and how to do floating point arithmetic on that family of processors, and my hard target was less than 15 minutes of total time wasted on a someone is wrong on the InterNet (https://www.xkcd.com/386/) discussion. If I could do that in under 15 minutes for a processor family I literally know nothing about, imagine what a real coder experienced in writing 64-bit assembler code for the intel 80x86 family of processors can do.

"You still haven't pointed out what exactly you think is garbage. Which line? Which instruction? Or don't you like that it generates some metadata?"

I haven't because I was on a mobile telephone and it was just too much of a bother to quote and reply what I wanted to address. I had to log into a workstation to reply to this and I resent that too. I don't like all of the above, the code and the metadata the compiler generated, because it's machine generated nonsense which jumps back and forth and it is hard to read and comprehend. Assembler code written by a human isn't. Now, since I already resent what I'm doing, I might as well go all the way. Let's pick apart the compiler generated examples in reverse order.

        movslq  %edx, %rdx

movslq isn't even necessary, but the compiler, having hard-coded logic and being just a program generated it anyway. Waste of CPU cycles.

        .p2align 4,,10
        .p2align 3

more assembler directives because a compiler cannot figure out whether such packing is necessary or not. My code didn't need them because I know that I don't.

        leal    1(%rdx), %eax
        testl   %eax, %eax

again, nonsense: two instructions wasted with what could amount to a single subq. This isn't more efficient.

        rep ret

Why, when a ret will do? Obviously in my code it worked, but the compiler felt that it had to generate this nonsense, which on top of being nonsense is harder to read and understand, and isn't any faster.

Next example:

        testl   %edx, %edx
        jle .L4
        leal    -1(%rdx), %eax

testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0, presumably to be able to utilize the instruction pipeline, but all the jump does is waste cycles.

        xorl    %eax, %eax

why xorl %eax in a loop when it could be done once? More waste of processor cycles. That code is extremely inefficient, and hopes to trade off readability for banking on a specific feature of intel 80x86, hoping that the instruction cache and pipelining will offset the losses. Idiotic in the extreme considering a human could get the same performance with far less and simpler code!

"Based on the data it looks like I was right to doubt you."

It might look that way to you now, but if you choose to acquire more experience in coding in assembler, you will eventually realize that I'm correct. You just need more experience, and from that will come insight. Run more tests on more algorithms; study the intel / AMD architecture manuals. Try to construct your own compiler, which I think will really make you realize why a program cannot beat a human. Shame only that you have to deal with such a shitty family of processors, but even that insight will be extremely valuable.

tom_mellior · on Oct 20, 2019

> Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?

It makes sense at times, in specific contexts. Which is a far cry from your "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine).

> nopw instructions, the idea being to give the processor time to prime the data cache.

I'm still only 95% sure what you mean by "priming the data cache", but if you mean prefetching, (a) nop doesn't prefetch; and (b) there are prefetch instructions for prefetching. And in any case prefetching wouldn't be cheating.

> If you do more comparisons over decades like I have

Compilers have gotten much better over the last few decades. Maybe the things you "know" are "always" true were true at some point in time and aren't true anymore.

> the far worse danger is that -ffast-math won't generate IEEE-754 compliant results

Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results. If you're OK with changing the semantics, you should use -ffast-math and let the compiler vectorize.

> There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction

For whatever it's worth, I think sub would be incorrect if the original bit pattern in the register were a NaN.

> two instructions wasted with what could amount to a single subq. This isn't more efficient.

And yet, somehow, this version of the code runs faster than your version.

> testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0

That's the code for testing whether the for loop should ever be entered. If n is less than or equal to zero (that's what's being tested by testl/jne), the function returns zero. Which is why xmm0 (the return register) needs to be zeroed in that case.

> why xorl %eax in a loop when it could be done once?

It's not in a loop. It's the "i = 0" setup code before the loop.

You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't.

Two last points before I drop this thread:

> you will see how silly it was trying to argue that compilers generate faster code than coders

I didn't argue that. I argued that your assertion "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine, again) was incorrect. That doesn't mean that I think that gcc will always, or even sometimes, beat human coders. But it can match them very very often.

> you chose a dot product because you likely knew that the compiler would generate pretty fast code [...] More waste of processor cycles. [...] Idiotic in the extreme

You're contradicting yourself. And you are calling idiots the many GCC developers and Intel/AMD microarchitecture experts who very likely have pored over every single instruction of this very code and decided that this is the way it should be written for maximum performance.

I hope you have a wonderful day.

Annatar · on Oct 20, 2019

"(a) nop doesn't prefetch;"

I wrote nopsw and you are writing about nop. Are you doing this on purpose? nop doesn't, but only on intel family of processors, nopsw has a side-effect of prefetching.

"Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results."

Turns out, so did the GCC compiler, at least the one I have, so I'd say your point is moot.

Truth of the matter is, you picked a really bad example: to solve it correctly, one would have to implement at least a portion of the algorithms in the GNU multiple precision library ("GMP"). I suspect you picking a floating point example was not by accident.

"I think sub would be incorrect if the original bit pattern in the register were a NaN."

Even NaN has to be represented by a bit pattern, so subtracting that bit pattern from the register will yield zero.

"That's the code for testing whether the for loop should ever be entered."

And here we come back to my point: if you were coding this from scratch in assembler, you wouldn't write a generic function, and you'd know that n will never be zero. And the reason why you'd never write a generic function is because they lose you speed and increase code size. But a compiler cannot know that and cannot optimize for such a situation. It's just a dumb program.

"You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't."

I spent 30 seconds looking at assembler code for a processor family I have never coded on. I spent less than 15 minutes writing a piece of optimized assembler code for that family and using GNU as, an assembler I never wrote code in. Now you judge me on mis-interpreting one clumsily generated compiler instruction. By which logic, considering I was able to do all of this in under 15 minutes am I not qualified? I'm very pleased with myself, for the time budget, an unknown processor and unknown assembler I think I did very well. We will have to disagree, vehemently if you please.

I stand by my assertion that a compiler will never be able to beat a human at generating fast, optimized code, nor will it ever be capable of generating smaller code. In addition I don't hold the GCC developers in high regard, considering how notoriously bad their compilers are when compared to say, intel or Sun Studio ones. Even the Microsoft compilers beat GCC in generating code which runs faster. In fact, pretty much every compiler beats GCC in performance, which means that people working on those GCC compilers aren't good enough. GCC's only undisputed strength is in the vast support of different processors. There, they are #1, but everywhere else they're last. The GCC developers just don't have what it takes to be the best in that business.

"And yet, somehow, this version of the code runs faster than your version."

I don't know that; you ran code which I wrote blindly; I was not even able to reproduce your output with my GCC. That it runs faster is just your assertion. Based on my experience, I have no reason to believe that.

"I hope you have a wonderful day."

As a matter of fact, I am about to go create a SVR4 OS package of GCC 9.2.0 which I patched and managed to bootstrap after a week worth of work on Solaris 10 on sparc, so yes I will have a wonderful day enjoying the fruits of my labors. I wish you a wonderful day as well.

Athas · on Oct 16, 2019

You are always at the mercy of the compiler, unless you use intrinsics or assembly directly (and then it's debatable whether what you are writing is really "C"). C compilers are pretty good, but there are things they cannot do.

For example, ISPC[1] is a low-level language that is very similar to C, but which makes it straightforward for the compiler to use vector instructions very aggressively - far more than any existing C compiler would be able to.

[1]: https://ispc.github.io/

rienbdj · on Oct 16, 2019

In theory, C and assembly will give you the fastest possible implementation. However, in practice, we're just not smart enough to write this code without introducing bugs. High level languages allow us to specify what we want the program to do in a concise form and then allow the compiler and runtime to optimize it for us. This works to varying degrees in reality.

A simple example is that pure functions can always be inlined and memoized.

easytiger · on Oct 16, 2019

In this case the answer is: do something entirely different than the program you are comparing it with

hoseja · on Oct 16, 2019

Here, it's apparently FEARLESS CONCURRENCY

atemerev · on Oct 16, 2019

So, if you can write research-level stuff, invent some new category-theoretical structures that warrant a scientific paper, know the exact language extension to enable from the multitude of exotic optimizations available in GHC, than yes, that’s perfectly simple and accessible to everyone!

chii · on Oct 16, 2019

It's an unfair comparison - monads are relatively well known concept, and compiler options and libraries are just as baffling in C as it is in GHC.

In all respects, the only "esotetic" thing is the WHNF (which, if you worked with haskell for any while, is quite understandable).

atemerev · on Oct 16, 2019

Flux monoids? You have to invent the new subclass of traversing monoids to achieve somewhat comparable performance to imperative code?

chii · on Oct 16, 2019

it is merely a monoid with some special property (which, is a logical property of concurrently counting words - "count the number of times a given invariant has changed from the start to the end of a sequence").

he wrote the monoid directly, rather than import the library - and it's a rather unimportant small detail.

He didn't invent a new research topic that nobody has seen before. You would've implicitly done this same thing if you wrote a map/reduce algorithm to count words across different machines.

loosetypes · on Oct 16, 2019

Reminded me of the following paper, Using Coq to Write Fast and Correct Haskell.

https://www.cs.purdue.edu/homes/bendy/Fiat/FiatByteString.ht...

maccard · on Oct 16, 2019

As a few have posted, this isn't decades of optimised C vs 80 lines of Haskell, this is basic C vs basic Haskell, and C is orders of magnitude faster.

The entire source for wc is here [0], and is 900 lines of very readable C, it does more than count words, includes options, arguments, usage, licenses. WHen I strip everything back, it's only about 300 lines of very readable C.

[0] https://github.com/coreutils/coreutils/blob/master/src/wc.c

saagarjha · on Oct 16, 2019

Here is some more up-to-date source, though it still hasn't been updated since 2008: https://opensource.apple.com/source/text_cmds/text_cmds-99/w...

Someone · on Oct 16, 2019

  >>> foldMap countChar "one two\nthree"
  Counts {charCount = 13, wordCount = Flux NotSpace 3
  NotSpace, lineCount = 1}”*

That lineCount = 1 surprised me, but testing on a Mac showed it to be ‘correct’. At least there, you can have a non-empty file that has zero lines.

boomlinde · on Oct 16, 2019

In POSIX, a text file is a file consisting of zero or more lines. A line is a sequence of non-null characters, shorter than a system defined maximum and terminated by a line feed character.

So a file that completely matches the string "one two\nthree" is by POSIX standards not a text file. Of course, wc doesn't necessarily operate on text files so you could interpret the result as a "complete POSIX line count".

axilmar · on Oct 16, 2019

Program running with multiple threads beats program running without multiple threads. News at 11.

sergeyz_prg · on Oct 17, 2019

The author compares a multi-core application performance to a single-core one . Is it fair?

olliej · on Oct 16, 2019

Duplicate of https://news.ycombinator.com/item?id=21261004

jonny383 · on Oct 16, 2019

[flagged]

tinco · on Oct 16, 2019

Why wouldn't they? It's a very good programming language that's underrepresented in the market. I don't get why people get so upset over language evangelists. Is it that you already know the languages they advertise are good, or that you don't want to know?

ncmncm · on Oct 16, 2019

It's that, by their own demonstration, the language is nowhere near as good as they say it is. What does "underrepresented" even mean, in this context? It is not used much because it is a PITA to get tolerable performance, where the C code -- and, moreso, the C++ code -- written in the most naive way is near optimal.

I am not going to bother demonstrating the parallel C++ version, but you already know it will be only a little longer than the naive one, and quite a lot faster than the Haskell one.

tinco · on Oct 16, 2019

You're right and Haskell doesn't really deserve more representation. I've built a couple of small web services, a network protocol and a compiler in Haskell, and for those cases it was super suitable, god forbid I'd have to do those things in C++. But definitely, there's a whole bunch of programming languages that would have been as suitable or better for those (except the compiler, Haskell really is the perfect compiler language).

I meant underrepresentation more generally, I appreciated Ruby evangelism back in 2006, when adopting Ruby really improved the lives of web developers. And I appreciate Rust evangelism now, because it improves the lives of zero overhead developers. At some point those languages could be at their "optimal" representation though. And not all representation is fair, I still thing it's insane that people are advocating Rust for web development, but I also think it's insane to use Javascript for backend web development, and here we are with half the industry mired in shit.

verttii · on Oct 16, 2019

Haskell's focus is arguably making correct software, not the fastest implementations. Such intensive focus on correctness is not usual. In that light I do feel it's somewhat under-represented.

johnisgood · on Oct 17, 2019

Ada/SPARK is underrepresented as well. Correctness? You most definitely got it. Performance? Yep. Easy-to-use language constructs for concurrency? Yep. And so on and so on.

FpUser · on Oct 16, 2019

"Is it that you already know the languages they advertise are good, or that you don't want to know?"

How about do not care option. There are many decent languages to choose from, some general, some targeting particular areas. They're just tools.

juped · on Oct 16, 2019

You're upset that people who like Haskell write articles about Haskell? That people who like things think they're good?

verisimilitudes · on Oct 16, 2019

This is interesting and also unsurprising. Haskell is an abstract and high-level language and so doesn't dictate irrelevant machine details as C does. The only reason C's considered efficient at all is due to what was effectively a meme from forty years ago; there have been lifetimes of human effort wasted making C fast, only to trivially be beaten by someone using better tools. A proper analogy would be beating the world's faste runner in a mile-long dash by using a bicycle or perhaps using a cart being more efficient than carrying items on one's head; the main difference is people haven't been deluded to think that humans are faster or better at carrying cargo than a creation of mankind.