ARMv7 vs. x86-64: Pathfinding benchmark of C++, D, Go, Nim, Ocaml, and more

_19qg · on Dec 20, 2014

    (defun get-longest-path (nodes node-id visited)
      (declare (optimize (speed 3) (space 0) (debug 0) (safety 0)
                         (compilation-speed 0)
                         #+lispworks (fixnum-safety 0))
               (type fixnum node-id)
               (type (vector node) nodes)
               (type (vector atom) visited))
      (setf (aref visited node-id) t)
      (Let ((max (loop for neighbour of-type route across (node-neighbours (aref nodes node-id))
                       unless (aref visited (route-dest neighbour))
                       maximize (the fixnum
                                     (+ (the fixnum (route-cost neighbour))
                                        (the fixnum (get-longest-path nodes (route-dest neighbour) visited)))))))
        (declare (fixnum max))
        (setf (aref visited node-id) nil)
        max))

Above Common Lisp version improves the runtime from 8.5 to 3.6 seconds in SBCL and from 30 seconds to 2 seconds in LispWorks 64bit. Computer: i7 Mac mini.

_19qg · on Dec 20, 2014

I've created a version with some explanations about performance issues...

https://gist.github.com/lispm/e9372894519f8e6feae1

logicchains · on Dec 20, 2014

Nice, re-running the benchmark with the new code now.

*Edit: and, it's done.

_19qg · on Dec 20, 2014

here is a much faster version:

https://gist.github.com/lispm/6066e1eeadf943910c47

You might want to adapt it...

logicchains · on Dec 20, 2014

I'm off to bed now, I'll pull it in the morning.

userbinator · on Dec 20, 2014

I'd really like to see a "pull out all the stops" benchmark using highly-optimised Asm for the two architectures, as then it's just a matter of how much you can squeeze out of the CPU itself and not something limited by the thick layers of language abstractions on top of that. That would be a nice theoretical maximum to compare against.

Edit: I tested the C++ version on my 5-year-old i7, with an even older compiler (just had to modify the code to not use C++11 features), and with the max optimisation level, it produces a result of 1465ms - which is pretty damn amazing, considering that this is a 16-year-old compiler generating 32-bit code and the most recent CPU it had knowledge of was the Pentium Pro (P6)! I'm convinced that an Asm version could be <1s though, so there's still plenty of room for improvement.

nly · on Dec 20, 2014

You have to give Intel/AMD a lot of credit for improving the performance of existing binaries. They've reduced CPI and improved ILP a lot in 16 years, not to mention vastly improved various forms of prediction, caching and speculation. A lot of the performance improvement gleaned from modern compilers will be better reduction of the C++ language itself, e.g. smarter inlining, LTO, devirtualisation etc.

Symmetry · on Dec 20, 2014

So if it's a "Galaxy S3 with 2GB of ram and a quad-core 1.3ghz processor" then this should be the version with the Exynos SOC. That means that these are cores inside are A9s. The Intel cores are from the Westmere generation.

I wonder how much of the difference we're seeing between various languages is the quality of the code their compilers generate for various backends and how much is due to the different languages benefiting more from architectural differences between the two chips.

I imagine that languages that generate code with more indirection are going to excersize the prefetcher and branch predictor of the core they're running on much more than languages that generate code with simpler control flows. Both the A9 and the Nehalem cores are out of order but the Nehalem has a much, much more sophisticated set of facilities for that. I predict that if you were to re-run the benchmarks on an iPhone 5S you'd see much less of a difference between the various ARM times. And if you were to run it on a cheap Android phone with A7 or A53 cores you'd see a much larger difference.

alain94040 · on Dec 20, 2014

Correct. Not to mention that benchmarking on a mobile phone is tricky. You need to make sure your CPU won't throttle due to heat. Your average mobile CPU can't run at 100% for very long. The more benchmarks you run back to back, the slower you get, so your measurements can easily be completely wrong.

acqq · on Dec 20, 2014

Exactly, the speed of mobiles is limited to short bursts to preserve the battery. And sometimes even the worse code gets more CPU if it hits the right patterns.

The timings presented are also confusingly presented for my taste.

pja · on Dec 20, 2014

Based on profiling the C++, this seems to be more a benchmark of function call cost in various languages than anything else - this code makes 42975348 calls to

  int getLongestPath<16>(std::vector<node, std::allocator<node> > const&, int, std::bitset<16>)

Which is fine, but seems a slightly limited platform / language comparison!

rwmj · on Dec 20, 2014

Which version of OCaml? The ARMv7 backend was rewritten about 2 years ago, and merged in 4.00.

http://caml.inria.fr/mantis/view.php?id=5433

The new backend is supposed to be considerably faster on floating point code. This code looks integer only, and I don't have relative performance of old/new backend for integer code.

As a wider question: Who cares much about ARMv7? ARMv8 is a completely different beast, requiring a different backend, with much better raw performance (on the same terms as x86-64). That's where languages should be concentrating their current efforts.

kryps · on Dec 20, 2014

Everyone cares about armv7 because it is the CPU architecture deployed in over 95% of the mobile devices.

logicchains · on Dec 20, 2014

OCaml version 4.02.0. I don't have an ARMv8 device to benchmark on, unfortunately.

jensnockert · on Dec 20, 2014

Except it's really hard to benchmark on ARMv8 since non-Apple hardware is expensive.

rwmj · on Dec 20, 2014

Big companies have plenty of access. I work for Red Hat and have one right next to my desk. I also know SUSE and MSFT have the hardware, and I'm sure many more.

I've been working on fixing bugs in the ARMv8 backend in the OCaml compiler - it's now stable, and performs very well (I can't go into exact performance details for contract reasons).

http://caml.inria.fr/mantis/view.php?id=6489

http://caml.inria.fr/mantis/view.php?id=6284

http://caml.inria.fr/mantis/view.php?id=6283

(and more ...)

lazyjones · on Dec 20, 2014

> hard to benchmark on ARMv8 since non-Apple hardware is expensive.

If Android will do - Cortex-A53 smartphones like the Huawei Ascend Y550 cost 120€ nowdays in Europe (about 4x the price of a Raspberry Pi).

rwmj · on Dec 20, 2014

Also Nexus 9, for quite a bit more money. The problem with these phones and tablets is they are quite unlike server hardware:

- Far too little memory (2GB vs 16/32GB+)

- Slow flash vs SATA disks

- The "zoo" of u-boot/proprietary kernel crap, instead of UEFI, ACPI and standard upstream kernels

- Nonsense like locked bootloaders (Nexus 9 disables HYP mode in the bootloader!!!)

IsTom · on Dec 20, 2014

> Functional code in Haskell/OCaml can be faster than imperative code using iorefs.

IORefs involve locking. They are bad performance-wise. An algorithm like this should be done either fully functionaly without any mutation at all or in ST.

eschew · on Dec 20, 2014

GHC, at least, uses the same implementation for IORef and STRef: http://hackage.haskell.org/package/base-4.3.1.0/docs/src/GHC...

Poking through GHC's runtime source, I see a writer barrier but no locking -- did I miss something?

IsTom · on Dec 20, 2014

Looking deeper you're right, it seems to use atomic CAS operations. Still mutable data is bad for GC and optimization pass (can't share expressions etc.)

logicchains · on Dec 20, 2014

Do the writes to the unboxed mutable bool vector also involve locking?

IsTom · on Dec 20, 2014

I would assume that no, but AFAIK GHC has separate (badly optimized) path for garbage collection on mutable data.

timthorn · on Dec 20, 2014

This is a comparison of specific implementations of the two processor architectures. The benchmarking is still an interesting work, but it isn't a straight comparison of language performance across architectures. It might be a more enlightening result to know the number of opcodes each run executed.

FraaJad · on Dec 20, 2014

No love for nim on this thread? nim is as fast as c++ with clang. http://www.reddit.com/r/programming/comments/2pvf68/armv7_vs...

def- · on Dec 20, 2014

I think the Nim code is also pretty readable: https://github.com/def-/LPATHBench/blob/master/nim.nim

And Nim doesn't look bad in the other statistics (on an x86_64 Intel Core2Quad Q9300):

    Lang    Time [ms]  Memory [KB]  Compile Time [ms]  Compressed Code [B]
    Nim          1400         1460                893                  486
    C++          1478         2717                774                  728
    D            1518         2388               1614                  669
    Rust         1623         2632               6735                  934
    Java         1874        24428                812                  778
    OCaml        2384         4496                125                  782
    Go           3116         1664                596                  618
    Haskell      3329         5268               3002                 1091
    LuaJit       3857         2368                  -                  519
    Lisp         8219        15876               1043                 1007
    Racket       8503       130284              24793                  741

Code size with gzip -9 < nim.nim | wc -c. Removed unused functions in Haskell.

brucer · on Dec 21, 2014

I'm really impressed you implemented this in so many languages!

I was curious how fast Javascript would run this, so I ported the lua example over, it ran in 2200msec on my mac laptop. I'm not sure it's a great comparison benchmark however, because there are some obvious optimizations you can make. I added a simple cache to avoid recalculating travel costs for leaves and it now runs in 150msec.

var fs = require('fs');

  var nodes = [], visited = [];
  var visitCache = {};

  function readPlaces() {
    var lines = fs.readFileSync('agraph','utf8').split('\n');
    var numnodes = lines[0]|0;
    for(var i=0;i<numnodes;i++) {
      nodes[i] = [];
      visited[i] = false;
    }
    for(i=1;i<lines.length;i++) {
      var line = lines[i].split(' ');
      nodes[line[0]|0].push({dst: line[1]|0, cost: line[2]|0});
    }
  }

  function getLongestPath(nodes, nodeid, visited) {
    visited[nodeid] = true;
    var neighbours = nodes[nodeid];
    var max = 0
    for(var i=0;i<neighbours.length;i++) {
      if (!visited[neighbours[i].dst]) {
        var dist = neighbours[i].cost + getLongestPath(nodes, neighbours[i].dst, visited);
        if (dist > max) {
          max = dist
        }
      }
    }
    visited[nodeid] = false;
    return max;
  }

  function getLongestPathCached(nodes, nodeid, visited, depth) {
    var idx;
    visited |= 1 << nodeid;
    if (depth == nodes.length - 7) {
      idx = (visited * 32) + nodeid;
      if (visitCache[idx]) {
        return visitCache[idx];
      }
    }
    var neighbours = nodes[nodeid];
    var max = 0
    for(var i=0;i<neighbours.length;i++) {
      if (!(visited & (1 << neighbours[i].dst))) {
        var dist = neighbours[i].cost + getLongestPathCached(nodes, neighbours[i].dst, visited, depth+1);
        if (dist > max) {
          max = dist
        }
      }
    }
    if (idx) {
      visitCache[idx] = max;
    }
    if (typeof visited != 'number') {
      visited[nodeid] = false;
    }
    return max;
  }

  readPlaces();
  var start = Date.now();
  var length;
  if (nodes.length < 32) {
    length = getLongestPathCached(nodes, 0, 0, 0);
  } else {
    length = getLongestPath(nodes, 0, visited);
  }
  console.log(length+' LANGUAGE Javascript: '+(Date.now() - start));

logicchains · on Dec 21, 2014

Nice, your cached version is the fastest by far, although it uses a different algorithm. Running with `node js.js` without the cache, it takes around 6 seconds, close to Dart and Haskell.

brucer · on Dec 21, 2014

I just realized I overthought the caching scheme, and removing one check made it 3x faster :)

  function getLongestPathCached(nodes, nodeid, visited, depth) {
    var idx;
    visited |= 1 << nodeid;
    idx = (visited * 32) + nodeid;
    if (visitCache[idx]) {
      return visitCache[idx];
    }
    var neighbours = nodes[nodeid];
    var max = 0
    for(var i=0;i<neighbours.length;i++) {
      if (!(visited & (1 << neighbours[i].dst))) {
        var dist = neighbours[i].cost + getLongestPathCached(nodes, neighbours[i].dst, visited, depth+1);
        if (dist > max) {
          max = dist
        }
      }
    }
    visitCache[idx] = max;
    return max;
  }

logicchains · on Dec 22, 2014

Nice! That's almost as fast as the C++ one using caching; your algorithm must be better.

nly · on Dec 20, 2014

Seeing as how they're likely the most widely distributed Java runtimes on ARM, I would have liked to see Dalvik and ART benchmarks for the Java code

Key takeaway for me is that statically typed languages that are compiled to native code are still 2-3x faster than the fastest JITs. On both platforms.

logicchains · on Dec 20, 2014

Interestingly, now that someone submitted a change to the Java implementation, such that it doesn't use classes (unboxing is simulated by using arrays for the data instead), it runs much closer to native speed.

0xFFC · on Dec 20, 2014

can you explain a little further ?

logicchains · on Dec 20, 2014

I think it's easier to read the code:

https://github.com/logicchains/LPATHBench/blob/master/jv.jav...

Instead of a vector of node classes, there's a

    static final int[][] nodes;

Which is used in a similar manner to a vector of node classes, but due to containing primitives (ints) is unboxed.

igouy · on Dec 21, 2014

When "the longest path problem" is reduced to indexed-access to an integer-sequence, does it become a duplicate of fannkuch?

http://benchmarksgame.alioth.debian.org/u64q/performance.php...

weitzj · on Dec 20, 2014

and you do not copy the bool array around as it was done in the previous version

jarcane · on Dec 20, 2014

The Racket and Lisp comments are a bit odd. To the best of my knowledge Typed Racket does support gradual typing, and as well, comparing it's compatibility to Scheme is a category error: Racket is not Scheme anymore, that's why it's called Racket now and not PLT Scheme.

logicchains · on Dec 20, 2014

>To the best of my knowledge Typed Racket does support gradual typing

If you take a Racket program and change the language to Typed Racket, you'll get flooded with compiler errors. Or at least that's been my experience.

>Racket is not Scheme anymore

I'll update the post to note this, thanks.

klibertp · on Dec 20, 2014

> If you take a Racket program and change the language to Typed Racket, you'll get flooded with compiler errors. Or at least that's been my experience.

That's because when using `#lang typed/racket` you literally change the language you're writing and that includes the syntax of some very fundamental forms (like define or struct).

You can mix typed and untyped code freely as long as they are separated by module boundaries. Something like this:

    #lang racket/base

    (module with-types typed/racket
      (provide a)
      (: a (-> String Integer))
      (define (a x) 0))

    (module without-types racket
      (require (submod ".." with-types))
      (provide a))

But it's definitely not as convenient as Erlang's Dialyzer, TypeScript, Dylan or similar success/gradual/occurrence/soft-typing solutions for sure.

kuschku · on Dec 20, 2014

You can use typed/racket with optional types, it’s kinda easy.

klibertp · on Dec 20, 2014

How? The docs say (http://docs.racket-lang.org/ts-guide/more.html#(part._when-a...) that you have to provide type annotations in certain cases.

bad_login · on Dec 20, 2014

I just made the code a little more idiomatic - avoided using the box arount max - used a named let for loop and made it terminal recursive

removed the box improve performance from 10s to a 9s.

  #lang typed/racket

  (struct: route ([dest : Integer] [cost : Integer]) #:transparent)

  (struct: node ([neighbours : (Listof route)]) #:transparent)

  (: str->int (String -> Integer))
  (define (str->int str)
    (define n (string->number str))
    (if n (numerator (inexact->exact (real-part n))) 0))

  (: read-places (-> (Vectorof node)))
  (define (read-places)
    (define lines
      (file->lines "agraph"))
    (define num-lines (str->int (car lines)))
    (define nodes (build-vector num-lines (lambda (n) (node `()))))
    (let loop ([i : Integer 0])
      (define nums (string-split (list-ref (cdr lines) i)))
      (define len (length nums))
      (when (and (> len 2) (> (length lines) (+ i 2)))
          (let ([node-id (str->int (list-ref nums 0))]
                [neighbour (str->int (list-ref nums 1))]
                [cost (str->int (list-ref nums 2))])
            (define new-node (node
                              (append (node-neighbours (vector-ref nodes node-id))
                                      (list (route neighbour cost)))))
            (vector-set! nodes node-id new-node)
            (loop (+ i 1)))))
    nodes)

  (: get-longest-path ((Vectorof node) Integer (Vectorof Boolean) -> Integer))
  (define (get-longest-path nodes node-id visited)
    (vector-set! visited node-id #t)
    (define sum
      (foldr
       (lambda ([neighbour : route] [max : Integer])
         (if (not (vector-ref visited (route-dest neighbour)))
             (let ([dist (+ (route-cost neighbour) (get-longest-path nodes (route-dest neighbour) visited))])
               (if (> dist max)
                   dist
                   max))
             max))
       0
       (node-neighbours (vector-ref nodes node-id))))
    (vector-set! visited node-id #f)
    sum)

  (define nodes (read-places))
  (define visited : (Vectorof Boolean) (build-vector (vector-length nodes) (lambda (n) #f)))
  (define start (current-inexact-milliseconds))
  (define len (get-longest-path nodes 0 visited))
  (define duration (- (current-inexact-milliseconds) start))
  (printf "~a LANGUAGE Racket ~a\n" len (inexact->exact (floor duration)))

logicchains · on Dec 20, 2014

Thanks! I updated the code.

kev009 · on Dec 20, 2014

I'm not really sure what this data means because amd64 and ARMv7 are ISAs. For instance, you could make a very deep and superscalar ARMv7 chip that blows a typical amd64 out of the water if you sacrifice size and power. Is the intent simply to show that some language backends are not optimized? Otherwise, without something like "These two chips and clock-for-clock or watt-for-watt it looks like this" it seems meaningless.

logicchains · on Dec 20, 2014

It's not meant to compare AMD64 and ARMv7 architectures, it's meant to compare the performance of various language compilers/runtimes on two common AMD64 and ARMv7 chips.

kev009 · on Dec 20, 2014

Not sure why the downvote as that was my request for clarification.

Point being, if you want to benchmark language backends, it doesn't make a lot of sense to cross chips without making mulch-dimensional benchmark (i.e adding in clock or power vectors).

After rereading closely, the intent of the article is saying "50% slowdown for backend X on x86, 70% slowdown for backend X on ARM." The subtle difference is that you are keeping comparisons of slowdown in the family and only making analogy to the other arch by slowdown percentage.

It's still susceptible to ISA implementation (for instance an in order Atom might fare a lot worse for typical backends), but mildly interesting.

Symmetry · on Dec 20, 2014

I expect that the differential performance between the various languages has just as much to do with architectural features such as prefetchers as it does with the ISA itself.

atoponce · on Dec 20, 2014

The LuaJIT results don't surprise me. I've always been impressed with LuaJIT. The OpenJDK results also don't surprise me. If you work in a Java shop, you learn very quickly to throw out OpenJDK in favor of Sun/Oracle Java. OpenJDK is indeed a "steaming pile of crap".

However, I would have liked to see Julia and Javascript benchmarks in those results. I've heard great things about Julia, and knowing just how incredibly far we've brought the Javascript VMs over the past decade, it wouldn't surprise me to see Javascript fairly high on the list.

logicchains · on Dec 20, 2014

>OpenJDK is indeed a "steaming pile of crap"

Just to clarify, it performed as well as the Oracle JVM on x86. Its poor performance on ARM is just due to its lack of JIT compilation.

>However, I would have liked to see Julia and Javascript benchmarks in those results.

I'm happy to include Javascript or Julia implementations if someone supplies them. I wasn't comfortable with Julia enough to write one myself.

atoponce · on Dec 20, 2014

>> OpenJDK is indeed a "steaming pile of crap"

> Just to clarify, it performed as well as the Oracle JVM on x86. Its poor performance on ARM is just due to its lack of JIT compilation.

Understood. Admittedly, I'm a system administrator first, developer second. However, my experience has shown that Sun/Oracle JVM usually out performs OpenJDK. Even in development, on the Java teams I've had to support, OpenJDK is never preferred or wanted.

Further, RHEL ships the OpenJDK compiled with the GNU compiler for Java (GCJ), as well as GNU's classpath. From what I've seen supporting these Java teams, and not being a Java developer, it's unstable and slow.

So, in this specific instance, OpenJDK may have performed as well as the Sun/Oracle JVM on x86-64, but in the broader scope, it doesn't seem to hold up. Just my experiences though. Take it with a grain of salt.

threeseed · on Dec 20, 2014

> OpenJDK is never preferred or wanted

I would argue it's actually risky.

Almost every Java developer uses the Oracle version because (a) it is what is recommended for OSX which is a popular development platform and (b) the bundled tooling is much better than OpenJDK. Hence you shouldn't mix/match JVMs just in case you hit implementation differences.

imanaccount247 · on Dec 20, 2014

>However, my experience has shown that Sun/Oracle JVM usually out performs OpenJDK. Even in development

Your experience seems to be filtered through your misconceptions. It is the same code base.

exabrial · on Dec 20, 2014

I wonder what magic sauce Oracle does then with their JDK? We can measure a very very very significant difference in performance both on x86 and ARM.

atoponce · on Dec 20, 2014

It's not my misconception- it's that of the developers I support. I am told to install Oracle JVM, not OpenJDK. When I push back, I'm told they don't want it.

I am not a Java developer. I am a system administrator.

th0br0 · on Dec 20, 2014

It seems to me though, that the Java numbers aren't representative. As far as I know, JVM benchmarks should allow HotSpot etc to optimise the functions by calling them ~1k times before the actual benchmark happens. That doesn't seem to be the case in the code.

logicchains · on Dec 20, 2014

The same function is being called over and over millions of times, so hopefully that should give HotSpot a chance to kick in. I didn't want to add artificial warm up as I don't think that's representative of real code.

igouy · on Dec 21, 2014

Check times with JMH

http://openjdk.java.net/projects/code-tools/jmh/

Also -XX:-PrintCompilation -XX:-PrintGC

https://wikis.oracle.com/display/HotSpotInternals/MicroBench...

gaius · on Dec 20, 2014

This is some quality work. The Java results are particularly striking! It would be interesting to see some more languages that have more than one implementation compared.

What is the difference between FSharp and F#?

logicchains · on Dec 20, 2014

FSharp instead of F# is just a way of getting around issues with using the # character in the shell scripts that run the benchmark.

gaius · on Dec 20, 2014

But he says F#, Haskell, Rust and Dart send their apologies. F# didn't have an Arch Linux package for ARM, and when I built it myself the compiler and runtime segfaulted upon use

logicchains · on Dec 20, 2014

Wow, you're right, no idea where that FSharp entry on the ARM results is coming from.

icefox · on Dec 20, 2014

Something not measure that is just as important as speed (if not more so) was how long it took to create a working program in each language.

wtf_is_up · on Dec 20, 2014

Agreed. We must also consider what a pain in the ass it was to set up the toolchains and/or runtimes for several of the tested languages.

I'm willing to trade a little performance and do a little extra error checking if it means I can just do:

    GOARCH=arm go build

and then copy the binary over.

logicchains · on Dec 21, 2014

> Agreed. We must also consider what a pain in the ass it was to set up the toolchains and/or runtimes for several of the tested languages.

The Arch Linux package repository proved really awesome here: for the vast majority of things (everything on x86) I could just `pacman -S myLang`. The hardest to install was SBCL on ARM, as I needed another Common Lisp to compile it, so had to get ClozureCL and use that.

steveklabnik · on Dec 20, 2014

> [Rust] 0.12 is so much prettier; vec[i] instead of vec.get(i), for instance

That was just a momentary bit of weirdness, vec[i] has since returned.

EDIT: Further down, the author acknowledges that this has been fixed. That's what I get for commenting before scrolling the whole way down!

smosher_ · on Dec 20, 2014

I wish there was more substance to the Rust section. Most of it is in strike-out and the remaining text has more to say about how you would do things in OCaml than Rust.

0x0 · on Dec 20, 2014

It seems kinda unfair to compare 32bit ARMv7 against 64bit x86-64. Wouldn't it be much fairer to compare ARMv7 against the register-starved x86, or to compare AArch64 against x86-64?

logicchains · on Dec 20, 2014

It's not comparing the architectures against each other, it's comparing language implementations against each other on those architectures. As x86-64 is the most common x86 architecture nowadays, and ARMv7 the most common ARM architecture (at least on mobile), comparing language's performance on them should hopefully be more relevant as more developers use those platforms.

KaiserPro · on Dec 20, 2014

Not to mention clock and cache differences.

bluecalm · on Dec 20, 2014

Obligatory question: what compiler options did you use for c++? Good first try is -o3 -march=native. It really makes huge difference in case of C/C++

logicchains · on Dec 20, 2014

It's in the makefile in that repo:

    	g++ cpp.cpp -std=c++11 -Wall -O2 -march=native -o cpp

I was originally using -O3 but then someone found that -O2 is actually faster in this case.

vmarsy · on Dec 20, 2014

When visited was changed from a vector<bool> to a bitset<T>, why is it not passed by reference anymore?

pbsd · on Dec 20, 2014

The most common call in the benchmark is to `getLongestPath<16>`. Passing 16 bits by value is cheaper and easier for the compiler to optimize (i.e., no aliasing) than passing by reference. I'd expect graphs with many nodes to behave differently.

arthursilva · on Dec 20, 2014

I updated rust versions and submitted PR https://github.com/logicchains/LPATHBench/pull/33

kyrra · on Dec 20, 2014

In the Go code, if you want to ignore an err you can write _ in its place.

Another things that I'm not sure you can even do anymore is disable bounds checking by adding "-gcflags -B" to the compile.

logicchains · on Dec 20, 2014

> In the Go code, if you want to ignore an err you can write _ in its place.

Not a good habit to get into though.

mrottenkolber · on Dec 20, 2014

That is some horrible Common Lisp code. I have to try this out....

_19qg · on Dec 21, 2014

I have improved it.

https://github.com/logicchains/LPATHBench/blob/master/lisp.l...

mrottenkolber · on Dec 26, 2014

Thank for the effort! Not sure how fleshed out the other implementations are but Lisp vs C now shows what I would have expected:

* SBCL can produce impressive x86 code

* It's ARM branch is pretty new, so it but 30% isn't too bad.

mrottenkolber · on Dec 20, 2014

Nope sorry. I tried to fix this up a little but this is too convoluted for me to work with.

Q1: Why do you need an adjustable array?

Q2: Why do you need a structure with a single slot (node)?

If you want to benchmark a piece of code, please write a nice version and then optimize it. How can I reason about a benchmark result if the code is not understandable?

wtf_is_up · on Dec 20, 2014

Many others were able to send pull requests. What's you're issue?

mrottenkolber · on Dec 21, 2014

The original code was badly written so that is was hard to understand the algorithm. I tried to rewrite it but it took longer than I had patience for.

The snarky tone comes from the fact that people often post "benchmarks" including languages they can't program in resulting in a misrepresentation.

E.g. Lisp jumped from the bottom to the center (as expected) of the benchmarks after somebody donated a sane implementation.

jawngee · on Dec 20, 2014

You should ask for a refund.

mrottenkolber · on Dec 21, 2014

I should, he used SBCL ticks (relative unit of time) as if they were milliseconds.

melling · on Dec 20, 2014

"The F# was however nowhere near as fast as the OCaml"

I find this a bit odd. Microsoft should be able to build a better compiler. I thought F# was getting a lot of traction.

profquail · on Dec 20, 2014

F# is quite fast (I've written a good deal of F# code) and it has seen a big uptick in adoption over the last couple of years.

I had a look at the F# code and it's not all that efficient (there are a number of intermediate data structures being created that don't need to be). I'll have a go at optimizing the code to see if I can improve the performance.

logicchains · on Dec 21, 2014

>I find this a bit odd. Microsoft should be able to build a better compiler.

It was running on Mono, not the Microsoft CLR.

_wldu · on Dec 20, 2014

C++ wins again. There's really no competition.

cbd1984 · on Dec 20, 2014

Meh. Only if you value performance above correctness and maintainability.

Besides, C wasn't tested. C would win if it were, I'm pretty sure.

atilaneves · on Dec 20, 2014

Why do people still assume C is always fastest? You can write C++ and D like it's C and it'll produce identical machine code. Which means that after profiling an idiomatic C++/D program(and assuming for the sake of argument that C is faster, which isn't even close to necessarily true) you could rewrite the hotspots in C-style and achieve the same performance.

I don't understand why this myth persists, despite the fact that Fortran has been beating C at several benchmarks for as long as C has existed.

userbinator · on Dec 20, 2014

I'm pretty sure you can write correct and maintainable code in any language. Whether its users do is a different question.

MichaelGG · on Dec 20, 2014

If even top-notch teams like Mozilla cannot ship large C/C++ codebases without security holes, it seems legitimate to state that people cannot write correct code in that language.

nly · on Dec 20, 2014

Compared to equally large codebases with comparable attack surface written in...

MichaelGG · on Dec 21, 2014

Really? You wanna make the case that memory safe apps would expose all sorts of RCEs if they were only larger? How about nearly every website in the world? All that comes to mind is eval() like exploits, like Rails had.

ShawnyShawny · on Dec 20, 2014

Is Mozilla really a "top-notch team", though? When I look at their track record, it's not what I'd expect from a top-notch team:

* They've alienated many Firefox users thanks to many bad UI changes. They've continued to do this even after the users have strenuously objected to these unwanted changes. Firefox's share of the market has thus dropped from 35% to probably sub-10% these days.

* Many of the remaining Firefox users still point out that Firefox is slower and more bloated than other browsers. Although Mozilla often rejects or ignores these complaints, my years of software development experience have taught me that when many users say there's a problem, there very likely is one, even if we the developers can't reproduce it.

* Firefox for Android hasn't been picked up by many users.

* Firefox isn't an option on iOS. (Although I guess we can't fully blame Mozilla for this.)

* Firefox OS is floundering. The devices available so far have fared very poorly in reviews. Some of these reviews are among the harshest I've ever seen for any software or hardware product.

* Thunderbird is on life support.

* Rust is still pre-1.0, and will be like this for several more months, at the very least.

* Servo depends on Rust, so it being a viable option is still years away.

* There was that whole Eich debacle. It was pathetic, no matter how you look at it.

* Bugzilla is long forgotten these days.

* Despite absolutely massive funding from Google and now Yahoo, Mozilla hasn't managed to put out any other product that people actually want to use.

When I look at that track record, it's just one failure or disaster after another. It's not top-notch at all. So I'm not surprised that they have trouble using C and C++. They seem to be having severe trouble with pretty much everything they do!

rdc12 · on Dec 21, 2014

How many of your points actually deal with their ability to write fast/safe C++? Most of them deal with more management or UI design and their desire to write in Rust instead.

And considering where firefox was five years they have done a remarkable job at trimming it down, in particular in memory usage.

dbaupp · on Dec 21, 2014

Almost all of those are just front-end/"marketing" and are orthogonal to the quality of the core C++ code itself (e.g. the JS engine, or the audio decoders).

kibwen · on Dec 20, 2014

I have no particular love for Mozilla (no more than any other corporation), but your points all seem facile.

1) Regardless of whether anyone thinks Firefox's UI changes have been for the better, it's not possible to alienate Firefox users via UI changes because Firefox still offers the most customizable UI of any browser you've heard of. What are users going to say? "Damn you Mozilla, you made your browser look just like Chrome! I hate that so much, I'm switching to Chrome!"

2) Many of the remaining Firefox users point out that they have switched back to Firefox because Chrome has become bloated and slow. (Personally I think that everyone making this argument, on all sides, merely fails to appreciate what sort of benefit it brings to a browser to have a totally fresh user profile.) In overall benchmarks of memory usage and browser engine/Javascript engine speed, neither Chrome nor Firefox is significantly better by any significant margin.

3) Firefox for Android has between 50 and 100 million downloads on Google Play, and has a higher user rating than Chrome for Android (4.4 to 4.2).

4) Mozilla has announced earlier this month that they'll be shipping a Firefox for iOS, but given the crippling of third-party browsers on iOS I doubt Firefox will be any less hobbled than Chrome for iOS, and will certainly be worse than Safari.

5) I have no sales stats on Firefox OS, but given that they're still persisting in setting up new carrier partnerships I'd say they're better off than at least the Ubuntu phone. I'll probably never need a Firefox OS phone, but honestly if it weren't for the audacity of Mozilla trying to penetrate the OS market (and hence trying to end their reliance on the willing participation of third-party platforms to host their browser (which isn't so "willing" these days with the advent of locked-down platforms like iOS and WinRT (is that still even a thing?))), then I'd have already written off Firefox as dead in the water.

6) Thunderbird was never a moneymaker nor key to Mozilla's strategy, especially after the meteoric rise of web-based email clients.

7) Rust is the most interesting systems language to emerge in years, and its influence will be felt on every future systems programming language to come (though I am certainly biased here).

8) Servo is Mozilla's other project, aside from Firefox OS, that is so unbelievably audacious that I can't help but cheer them on. I have spoken with its developers and they're all astounded with the performance they're seeing, though they're holding off on releasing concrete numbers until the feature set is comparable with more complete browsers. Having seen the Servo devs in action, I can assure you they are on top of their game.

All this said, your original point was that you don't think that Mozilla has a top-notch team of C++ developers. And here's the thing: all of this is irrelevant to whether or not Mozilla's C++ developers are top-notch. I bet John Carmack's team at Id software was as top-notch a team of C++ developers as will ever be assembled, and yet Rage was still a commercial failure. To know whether or not the team is good, you have to look at their code and you have to look at their process.

TheLoneWolfling · on Dec 21, 2014

Regarding 1, although a single data point does not a trend make, I will say that I switched away from Firefox due to multiple UI changes and removing customization options. I switched to Pale Moon.

In other words, personally, you are incorrect. FF no longer "still offers the most customizable UI of any browser you've heard of". (Simple enough: FF removed options that PM kept. Hence, FF is not more customizable than PM.)

Jakuv2000 · on Dec 21, 2014

I switched away from Firefox for the same reasons. Even if the UI was as customizable as is claimed, I don't want to waste my time recustimizing it every six weeks, or whenever a new Firefox release comes out.

MichaelGG · on Dec 21, 2014

Pick your team then. How many people are actually shipping memory safe C code?

mda · on Dec 20, 2014

That's some alien Java code indeed.

logicchains · on Dec 20, 2014

Any improvements you can see? The 'inner loop' of the code is:

    int getLongestPath(ArrayList<node> nodes, int nodeID, boolean[] visited){
	visited[nodeID] = true;
	int dist, max=0;
	for(route neighbour: nodes.get(nodeID).neighbours){
	    if (!visited[neighbour.dest]){
		dist = neighbour.cost + getLongestPath(nodes, neighbour.dest, visited);
		if (dist > max){
		    max = dist;
		}
	    }    
	}
	visited[nodeID] = false;
	return max;
    }

The ArrayList of nodes could be changed to an array, but I don't imagine that'd be much faster (I've always heard that ArrayLists are just as fast as arrays, except for primitives).

*Edit: someone found a massive improvement, by replacing the node class with arrays, to simulate unboxing.

bsaul · on Dec 20, 2014

I don't think this improvement is really fair though. Everybody heard the same thing as you did : arraylist are ok performance wise. The optimisation of using arrays and static global variables and remove use of any object isn't something that leads to readable code in the long run, nor is it the code your regularely see in everyday software.

You should include both your old java code with the new one. Call your version "enterprise java" and the other "optimized java". If anything, that could be useful to people coding in java and having performance issue ( otherwise they'll have to look for it in git history, which they'd have no reason to).

logicchains · on Dec 20, 2014

It is idiomatic in highly performance-sensitive Java code though. Hopefully it won't be necessary in Java 9 or 10 when unboxing support is brought in.

I've added a note regarding the change, but I'm off to bed in a moment (AEST timezone). I'll separate the Java versions tomorrow.

exabrial · on Dec 20, 2014

You have a link to Java 9/10 class unboxing? I've been googling for it and can't find any information on it.

Dewie · on Dec 20, 2014

Have you googled something like value classes?

threeseed · on Dec 20, 2014

Managed to get a 22% improvement on Core i7 OSX / Oracle JDK8 by replacing the for iterator-style loop with a C style one. Am I missing something because this shouldn't be faster.

  	ArrayList<route> neighbours = nodes.get(nodeID).neighbours; 
	for(int i=0;i<neighbours.size();i++){
		route neighbour = neighbours.get(i);

Also add -Xbatch as a parameter which gives another few percentage points.

super_tycoon · on Dec 20, 2014

I refactored it to be more static and final/const. Also used an int[][] for the node data.

Original comment is on proggit. Went from 1600ms to 900ms.

http://pastebin.com/w2BC8fNg

http://www.reddit.com/r/programming/comments/2pvf68/armv7_vs...

zackangelo · on Jan 1, 2015

I was able to get the Scala version down to within ~100ms of the Java version while still using case classes.

WhyYes · on Dec 21, 2014

Wow Dart is impressive above Lisp and close to F#

mrottenkolber · on Dec 21, 2014

It's not, look again, the benchmark code was bad.

nkurz · on Dec 20, 2014

Not that the article does this, but since my current bandwagon is that assessing performance by comparing the source of two programs without consideration of the compiler and target processor is silly, I decided to try out the C++ version with several compilers and options. Renaming the file to 'lpath.cpp' and compiling with 'cc lpath.cpp -std=gnu++11 -Wall -Oxxx -march=native -o lpath-cc-Oxxx' here's what I found an an i7 Haswell running at 3.4 GHz:

  lpath-clang3.5-O2 8981 LANGUAGE C++ 763
  lpath-gcc4.7-O2 8981 LANGUAGE C++ 769
  lpath-gcc4.8-O2 8981 LANGUAGE C++ 746
  lpath-icpc14-O2 8981 LANGUAGE C++ 750
  lpath-icpc15-O2 8981 LANGUAGE C++ 735
  lpath-clang3.5-O3 8981 LANGUAGE C++ 734
  lpath-gcc4.7-O3 8981 LANGUAGE C++ 943
  lpath-gcc4.8-O3 8981 LANGUAGE C++ 946
  lpath-icpc14-O3 8981 LANGUAGE C++ 664
  lpath-icpc15-O3 8981 LANGUAGE C++ 655

The last column is the time reported by the program in milliseconds. What this shows is that the same source compiled with Intel's icpc 14 or 15 -O3 is about 50% faster than the same source compiled with g++ 4.7 or 4.8 and -O3, and about 20% faster against g++ -O2. The point isn't that Intel's compiler is so much better, but that this degree of variation is normal for a benchmark like this. There are times when Clang or GCC will come out ahead by the same margin. The lesson is that you aren't benchmarking source code, you are benchmarking a particular compiler with particular options running on a particular processor.

The article is wonderfully specific about what was used, but one should be very careful extrapolating to different combinations. In addition to the compiler differences, note for example that although I did my tests on a processor running less than 1.5x faster, I got runtimes that were almost 2-3x faster. Most likely, this is because the Haswell processor I tested on is more efficient than the several generation old Westmere that the author used. There's nothing right or wrong about either choice, but the degree of difference is why it's always important to specify.

I glanced briefly at the code with 'perf record -F10000', and my quick conclusion was the the Intel version was running faster because it was making better use of the branchless cmov's than the other compilers, and thus has 10,000,000 fewer branch prediction errors. At 20 cycles per miss, this accounts for over half the difference between icpc and the others. The use of the new fast variable shift instructions (shlx) and the once-again fast bit test (bt) instruction is probably the rest of the difference.

The difference between g++ -O2 and -O3 seems to be that the -O3 version is doing almost everything off the stack rather than in registers. It's bad enough that this is probably a performance bug rather than intended behavior.

igouy · on Dec 21, 2014

>>The article is wonderfully specific about what was used…<<

I'm missing something -- where does it say which version of JDK was used?

igouy · on Dec 21, 2014

tl;dr

Measurement is highly specific -- the time taken for this benchmark task, by this toy program, with this programming language implementation, with these options, on this computer, with these workloads.

Same toy program, same computer, same workload -- but much slower.

Measurement is not prophesy.