Author here. I'm open for discussion. The source is in the text, but convenientl...

thom · on June 22, 2018

Huge thanks for Neanderthal and Bayadera! There is stuff that is basically impossible to run in Stan that becomes feasible on a laptop with a GPU with Bayadera, and the fact that I can do all this in Clojure without having to migrate to Python or worse R has been a great boon.

dragandj · on June 22, 2018

Thanks for the thumbs up!

treo · on June 23, 2018

Awesome! I didn't expect you to write that blog post so soon. It will be interesting to find out where the differences (beyond the 'f' ordering thing) are that make Neanderthal so fast, since when I originally started the comparison, both basically boiled down to calling MKL through JNI.

For completeness sake, could you also provide some information about your machine specs and operating system (and if on linux your glibc and kernel version)?

blueyes · on June 22, 2018

TL/DR In Nd4j, historically, mmuli returns an F-ordered array. If you specify C-ordered result, you'll have F->C conversion underneath, and C -> F assign takes time.

ND4J returns the `mmuli` result in F order, and DL$J is aware of that. We design our algorithms around this. It is possible to get the `mmuli` result in C order, but it'll cause an additional conversion operation call, which is expensive. That's exactly what Dragan did. So on one hand - he's right: in the CCC case, Neanderthal is faster, because he doesn't have to do the F->C conversion of the result.

We make mmuli always return F because of Cuda. The cuBLAS implementation has no option for C ordered output.

`mmuli` CCC always introduces a dup for us. We expect the gemm result to always be F, so if result is not F, each operation allocates tempResult array, F ordered, and does result.assign(tempResult).

dragandj · on June 22, 2018

Neanderthal handles C order in cuBLAS without problems, though.

Thanks for the explanation. I used the code that ND4J guys provided, assuming that they'd do the right thing. Also read my response to crockpotveggies: provide the ND4J code that you think is optimal, and ND4J and Neanderthal results on your machine, and I'll be happy to write a follow up.

blueyes · on June 23, 2018

That's great, but that's not the essential point for us. with cuBLAS, there is no room for an output layout argument.

transpose is the only option.

cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float alpha, const float A, int lda, const float B, int ldb, const float beta, float *C, int ldc)

wmeddie · on June 23, 2018

Is there a guide for getting neanderthal running with MKL on Windows? I tried my best but I'm not smart enough to get passed the require statement.

dragandj · on June 23, 2018

Yes:

https://neanderthal.uncomplicate.org/articles/getting_starte...

And just to be sure, choose option 2 under that heading (add the folder with appropriate dlls to PATH)

gok · on June 22, 2018

How about running these benchmarks through a time profiler so you can see where the time is actually going?

dragandj · on June 22, 2018

Sure! What exactly are you interested in?

gok · on June 22, 2018

Your hypothesis is that MKL is taking the same amount of time and FFI overhead is the problem. A profiler could confirm your guess about MKL and tell you which part of the overhead is taking the time.

dragandj · on June 22, 2018

I have to leave some exercises to the reader though ;)