Huge thanks for Neanderthal and Bayadera! There is stuff that is basically impossible to run in Stan that becomes feasible on a laptop with a GPU with Bayadera, and the fact that I can do all this in Clojure without having to migrate to Python or worse R has been a great boon.
Awesome! I didn't expect you to write that blog post so soon. It will be interesting to find out where the differences (beyond the 'f' ordering thing) are that make Neanderthal so fast, since when I originally started the comparison, both basically boiled down to calling MKL through JNI.
For completeness sake, could you also provide some information about your machine specs and operating system (and if on linux your glibc and kernel version)?
TL/DR In Nd4j, historically, mmuli returns an F-ordered array. If you specify C-ordered result, you'll have F->C conversion underneath, and C -> F assign takes time.
ND4J returns the `mmuli` result in F order, and DL$J is aware of that. We design our algorithms around this. It is possible to get the `mmuli` result in C order, but it'll cause an additional conversion operation call, which is expensive. That's exactly what Dragan did. So on one hand - he's right: in the CCC case, Neanderthal is faster, because he doesn't have to do the F->C conversion of the result.
We make mmuli always return F because of Cuda. The cuBLAS implementation has no option for C ordered output.
`mmuli` CCC always introduces a dup for us. We expect the gemm result to always be F, so if result is not F, each operation allocates tempResult array, F ordered, and does result.assign(tempResult).
Neanderthal handles C order in cuBLAS without problems, though.
Thanks for the explanation. I used the code that ND4J guys provided, assuming that they'd do the right thing. Also read my response to crockpotveggies: provide the ND4J code that you think is optimal, and ND4J and Neanderthal results on your machine, and I'll be happy to write a follow up.
That's great, but that's not the essential point for us. with cuBLAS, there is no room for an output layout argument.
transpose is the only option.
cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const float alpha,
const float A, int lda,
const float B, int ldb,
const float beta,
float *C, int ldc)
Your hypothesis is that MKL is taking the same amount of time and FFI overhead is the problem. A profiler could confirm your guess about MKL and tell you which part of the overhead is taking the time.
The source is in the text, but conveniently accessible here: https://github.com/uncomplicate/neanderthal/blob/master/exam...