Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Author here. I'm open for discussion.

The source is in the text, but conveniently accessible here: https://github.com/uncomplicate/neanderthal/blob/master/exam...



Huge thanks for Neanderthal and Bayadera! There is stuff that is basically impossible to run in Stan that becomes feasible on a laptop with a GPU with Bayadera, and the fact that I can do all this in Clojure without having to migrate to Python or worse R has been a great boon.


Thanks for the thumbs up!


Awesome! I didn't expect you to write that blog post so soon. It will be interesting to find out where the differences (beyond the 'f' ordering thing) are that make Neanderthal so fast, since when I originally started the comparison, both basically boiled down to calling MKL through JNI.

For completeness sake, could you also provide some information about your machine specs and operating system (and if on linux your glibc and kernel version)?


TL/DR In Nd4j, historically, mmuli returns an F-ordered array. If you specify C-ordered result, you'll have F->C conversion underneath, and C -> F assign takes time.

ND4J returns the `mmuli` result in F order, and DL$J is aware of that. We design our algorithms around this. It is possible to get the `mmuli` result in C order, but it'll cause an additional conversion operation call, which is expensive. That's exactly what Dragan did. So on one hand - he's right: in the CCC case, Neanderthal is faster, because he doesn't have to do the F->C conversion of the result.

We make mmuli always return F because of Cuda. The cuBLAS implementation has no option for C ordered output.

`mmuli` CCC always introduces a dup for us. We expect the gemm result to always be F, so if result is not F, each operation allocates tempResult array, F ordered, and does result.assign(tempResult).


Neanderthal handles C order in cuBLAS without problems, though.

Thanks for the explanation. I used the code that ND4J guys provided, assuming that they'd do the right thing. Also read my response to crockpotveggies: provide the ND4J code that you think is optimal, and ND4J and Neanderthal results on your machine, and I'll be happy to write a follow up.


That's great, but that's not the essential point for us. with cuBLAS, there is no room for an output layout argument.

transpose is the only option.

cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float alpha, const float A, int lda, const float B, int ldb, const float beta, float *C, int ldc)


Is there a guide for getting neanderthal running with MKL on Windows? I tried my best but I'm not smart enough to get passed the require statement.


Yes:

https://neanderthal.uncomplicate.org/articles/getting_starte...

And just to be sure, choose option 2 under that heading (add the folder with appropriate dlls to PATH)


How about running these benchmarks through a time profiler so you can see where the time is actually going?


Sure! What exactly are you interested in?


Your hypothesis is that MKL is taking the same amount of time and FFI overhead is the problem. A profiler could confirm your guess about MKL and tell you which part of the overhead is taking the time.


I have to leave some exercises to the reader though ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: