A bit over 5 years ago, I added initial Cuda (https://github.com/mratsim/Arraymancer/pull/24) and OpenCL (https://github.com/mratsim/Arraymancer/pull/184) support to Arraymancer.
Since it was a while ago and I am now confronted with similar questions today for another project, I'd like to share the result of my investigations.
Arraymancer supports GPU operations on Cuda and OpenCL for matrix/tensor primitives (additions, multiplications, ...) despite several troubles.
For Cuda:
OpenCL is actually significantly less troublesome to support as the only issues are #4 (Nim artifacts) and #7 (Windows or Mac installs).
For my cryptographic library Constantine, I wish to add GPU backend to bigint and cryptographic code to meet the growing throughput demand of blockchains and maybe in the future homomorphic encryption (to enable computation on encrypted data).
There was a similar refactoring I considered for Arraymancer, by creating a custom DSL that would compile to Nim code on CPU, or be JIT-ed to LLVM IR and then GPU code at runtime for GPU compute (https://github.com/numforge/laser/tree/master/laser/lux_compiler). Using a compiler approach would void creating, debugging and optimizing compute kernels one-by-one on CPU, Cuda, OpenCL, Metal, Vulkan, ... And then implementing the gradient version of the function.
Unfortunately time is scarce, as I could see that plan taking a while, I started to wrap PyTorch backend in SciNim/flambeau but then didn't have time to revisit GPU code for years (just investigating and writing Weave took a long time!)
However, can-lehmann independently had a similar idea with exprgrad and went way farther than I in Laser/Lux: https://github.com/can-lehmann/exprgrad and talk https://www.youtube.com/watch?v=YXD8ZODahts
There are 3 main ways to generate GPU code:
Compile-time source compilation is the usual Cuda way with a .cu (or .Nim) file compiled with nvcc or clang and linked in the main application binary. It may require shipping the GPu artifacts separately, for example on Windows.
Runtime source code compilation is the usual OpenCL or shader way. It was also introduced for Cuda via NVRTC in 2015. This allows an application to generate Cuda or OpenCL source code, call the GPU compiler and then call the compiled kernel.
JIT to LLVM IR is generating LLVM IR and have LLVM deal with platform specialization.
As mentioned in the context and history section, the compile-time codegen plain doesn't work today for Cuda, and isn't an option for OpenCL.
The runtime source code generation is my default recommendation. Either kernel can be copy-pasted, or the application has an internal IR and generate OpenCL/Cuda code from it. In fact, Arraymancer cuda backend should be reconfigured to use NVRTC (Nvidia Runtime Compilation) instead of NVRTC and most woes would be solved with few code change.
The JIT part is recommended if you need LLVM IR anyway or support for a large amount of architectures, including DirectX, OpenGL/Vulkan/SPIR-V, Qualcomm Hexagon that might be quite different from AMD ROCm, Nvidia Cuda or OpenCL
LLVM has full support for Nvidia PTX codegen for Cuda/Nvidia GPUs.
LLVM has support for SPIR-V codegen since LLVM 15 (April 2022) which is necessary to target Intel GPU or for OpenCL and Vulkan backends via LLVM IR.
LLVM supports AMD GPUs and Direct X target as well
Compiling to Apple Metal cannot be done via LLVM IR despite Apple using LLVM :/ so it has to go through source-code.
You have a proof-of-concept LLVM JIT here:
And a proof-of-concept LLVM + Nvidia NVVM JIT here:
Note: The conversion from LLVM IR to Nvidia PTX (assembly) can be done either through LLVM or through Nvidia NVVM which has extra proprietary optimization passes.
Thanks for this overview, very interesting.
About compile-time generation, for info I experimented with a CUDA-like approach based on nlvm some time ago: https://github.com/guibar64/axel. It is still quite unfinished, though.
Also, your project is compiled twice for CPU and GPU (nvptx for now) respectively, which is admittedly a bit clunky.
About compile-time generation, for info I experimented with a CUDA-like approach based on nlvm some time ago: https://github.com/guibar64/axel. It is still quite unfinished, though.
I wasn't aware of Axel and it's definitely very interesting. I'm surprised you managed to have {.kernel.} work without going through the pain of rewriting the AST to avoid Nim artifacts like in https://github.com/jcosborn/cudanim/blob/338be78/inline.nim
Also, your project is compiled twice for CPU and GPU (nvptx for now) respectively, which is admittedly a bit clunky.
Which project?
Sorry, bad phrasing, I was commenting on axel.
What I meant was, when you compile a project with axel, with staticExec a second compilation is done on your project with nlvm-gpu.
I wasn't aware of Axel and it's definitely very interesting. I'm surprised you managed to have {.kernel.} work without going through the pain of rewriting the AST to avoid Nim artifacts like in https://github.com/jcosborn/cudanim/blob/338be78/inline.nim
kernel mostly tags the proc with a custom pragma, it is then handled in the modified nlvm to add proper annotations. The rest is already handled by nlvm's main codegen.
Thanks for the writeup @mratsim!
It'd be great to be able to do GPU / NN stuff in Nim (again).
Note that I really dislike LLVM -- it sucks as a user because the libraries break compatibility regularly. Your distro provide LLVM13 but your compiler needs LLVM14, etc. Then Apple's LLVM builds often don't work with regular LLVM libs. I guess that's just part of the nature of using any of the machine learning libraries though. :/
I'm not sure if pure CUDA C++ kernels have the same issue too. Also for the C++ compilation flags it would be possible to compile to source and then manually compile the generated C++ code. For CUDA kernels I'd guess that Nim features that use GNU features like computed goto's wouldn't be used.
It'd be great to be able to do GPU / NN stuff in Nim (again).
Unfortunately for the time being my focus is cryptography. For GPU / NN, I think Flambeau is the better bet.
It will have a side-effect of solving Cuda/OpenCL/Vulkan codegen via LLVM but the kernels still would have to be (re-)implemented.
Note that I really dislike LLVM -- it sucks as a user because the libraries break compatibility regularly. Your distro provide LLVM13 but your compiler needs LLVM14, etc. Then Apple's LLVM builds often don't work with regular LLVM libs. I guess that's just part of the nature of using any of the machine learning libraries though. :/
That was one of my concerns:
However, I don't think it will be an issue for my libraries:
Basically the only versioning woes I should get would be support for new backends, like OpenCL, OpenGL, Vulkan kernel generation via SPIR-V is only availble in LLVM-15 onward.
I'm not sure if pure CUDA C++ kernels have the same issue too.
Cuda C++ is also usually forward compatible with only new instructions like tensor cores, synchronization or unified memory that require new versions.
The latest big breakage was hardware-level with the RTX 2XXX series and Independent Thread Scheduling. GPU threads are organized in groups of 32 called a warp and within a warp they executed the same instructions, which also meant executing all branches of an if/then/else if at least one thread had to take a different branch from the other. RTX 2XXX and later allowed independent branching and since that was a decade old assumption, lots of synchronization code broke on those GPUs.
Also for the C++ compilation flags it would be possible to compile to source and then manually compile the generated C++ code. For CUDA kernels I'd guess that Nim features that use GNU features like computed goto's wouldn't be used.
If you do it at compile time:
If you do it at runtime, via NVRTC, which is my recommendation if you only want to support Nvidia, it should be the easiest to maintain and deploy. I'm only considering LLVM so I write LLVM IR and then all backends supported by LLVM (Nvidia, AMD, Intel via OpenCL/SPIR-V) are available.
Thanks for the detailed reply!
Instead of hardcoding a libLLVM-15.so version, you can autodetect it with {.passl: gorge("llvm-config --libs").}
That's handy to know.
LLVM IR by necessity is very stable and hasn't been changed for years. The C API is also fully-featured (compared to many other popular C++ projects like OpenCV) and depended on by many languages like Rust, Julia, ...
Do you mean the text or the binary format? I thought the binary format was not considered stable and only for usage with the current major version. Though, it's been a long while since I actually used either.
Either way, if you can target the LLVM IR rather than the C++ api it should be much more stable.
I'm not sure you can portably do staticWrite("foo.cpp");compile("foo.cpp"), you might need 2-stage compilation
No, that probably wouldn't work well. But more of a nim c --compileOnly someKernel.nim. Then you can write a Ninja file or something, or modify the build.sh that Nim produces.
Or you have Nim -> Cuda codegen with the same woes as Arraymancer regarding compiler and compilation flags config.
That's where I was figuring you could avoid lots of those woes by manually compiling the C++ kernels. Which would entail a two-stage compilation setup, but would avoid trying to batter Nim into using the right C++ flags. Perhaps I'm viewing it more similar to some of the embedded devices, which share a few features of GPU style targets -- weird libraries / compilers, differences in stdlibs, etc.
Though I don't have any concrete plans, but I'd like to know if the options are available.
I'll take a look at Flambeau. Is it usable? As in I don't mind updating things like older library versions etc, but is the core C++ interface fairly stable or usable with latest pytorch libs?
Also, could you point me to a good example of a CUDA kernel in arraymancer? I'd like to toy with the compile stuff a bit at some point.