nimforum mirror - The state of GPU codegen with Nim (bonus: LLVM JIT codegen)

mratsim (orginal) [2023-01-03T18:10:52+01:00] view original

The state of GPU codegen with Nim (bonus: LLVM JIT codegen)

A bit over 5 years ago, I added initial Cuda (https://github.com/mratsim/Arraymancer/pull/24) and OpenCL (https://github.com/mratsim/Arraymancer/pull/184) support to Arraymancer.

Since it was a while ago and I am now confronted with similar questions today for another project, I'd like to share the result of my investigations.

Context, history and current state

Current support and woes

Arraymancer supports GPU operations on Cuda and OpenCL for matrix/tensor primitives (additions, multiplications, ...) despite several troubles.

For Cuda:

Since Nim 1.2, Nim compiles C++ code with -std=gnu++14 which isn't supported by the Nvidia compiler NVCC (https://github.com/mratsim/Arraymancer/issues/478), hence Arraymancer Cuda backend needs Nim 1.0 which is now fairly ancient.

Clang can be used but latest clang is usually compatible with the second-to-latest cuda, not the latest one, leading to installation woes, though it seems like compatibility improved: https://gist.github.com/ax3l/9489132#clang--x-cuda

We need to use the C++ target

There is a need to cleanup the Nim code to avoid stackframes, initialization and other Nim codegen artifacts in the Cuda code. The easiest being interpolating Cuda/OpenCL code as text. Alternatively it is also possible to clean up by rewriting the Nim AST (https://github.com/jcosborn/cudanim/blob/338be78/inline.nim)

For cuda nim.cfg needs to be overriden so a custom config could be used. Also at the time localpassC didn't exist so passing --x cu (nvcc) or -x cuda needed to either specify each file in nim.cfg or pass the flags globally: https://github.com/mratsim/Arraymancer/blob/d4adbbd/nim.cfg#L9-L32

Messing with Nim flags meant keeping an eye when upstream added or removed Nim flags as flag such as -O3 -fno-strict-aliasing need to be rewritten as -Xcompiler -O3 -Xcompiler -fno-strict-aliasing

What about Windows installations?

OpenCL is actually significantly less troublesome to support as the only issues are #4 (Nim artifacts) and #7 (Windows or Mac installs).

Goal

For my cryptographic library Constantine, I wish to add GPU backend to bigint and cryptographic code to meet the growing throughput demand of blockchains and maybe in the future homomorphic encryption (to enable computation on encrypted data).

There was a similar refactoring I considered for Arraymancer, by creating a custom DSL that would compile to Nim code on CPU, or be JIT-ed to LLVM IR and then GPU code at runtime for GPU compute (https://github.com/numforge/laser/tree/master/laser/lux_compiler). Using a compiler approach would void creating, debugging and optimizing compute kernels one-by-one on CPU, Cuda, OpenCL, Metal, Vulkan, ... And then implementing the gradient version of the function.

Unfortunately time is scarce, as I could see that plan taking a while, I started to wrap PyTorch backend in SciNim/flambeau but then didn't have time to revisit GPU code for years (just investigating and writing Weave took a long time!)

However, can-lehmann independently had a similar idea with exprgrad and went way farther than I in Laser/Lux: https://github.com/can-lehmann/exprgrad and talk https://www.youtube.com/watch?v=YXD8ZODahts

GPU codegen

Codegen strategies

There are 3 main ways to generate GPU code:

Compile-time source code compilation.

Runtime source code compilation (a.k.a. compiler as a library)

JIT to LLVM IR

Compile-time source compilation is the usual Cuda way with a .cu (or .Nim) file compiled with nvcc or clang and linked in the main application binary. It may require shipping the GPu artifacts separately, for example on Windows.

Runtime source code compilation is the usual OpenCL or shader way. It was also introduced for Cuda via NVRTC in 2015. This allows an application to generate Cuda or OpenCL source code, call the GPU compiler and then call the compiled kernel.

JIT to LLVM IR is generating LLVM IR and have LLVM deal with platform specialization.

Recommendations

As mentioned in the context and history section, the compile-time codegen plain doesn't work today for Cuda, and isn't an option for OpenCL.

The runtime source code generation is my default recommendation. Either kernel can be copy-pasted, or the application has an internal IR and generate OpenCL/Cuda code from it. In fact, Arraymancer cuda backend should be reconfigured to use NVRTC (Nvidia Runtime Compilation) instead of NVRTC and most woes would be solved with few code change.

The JIT part is recommended if you need LLVM IR anyway or support for a large amount of architectures, including DirectX, OpenGL/Vulkan/SPIR-V, Qualcomm Hexagon that might be quite different from AMD ROCm, Nvidia Cuda or OpenCL

LLVM platform support

LLVM has full support for Nvidia PTX codegen for Cuda/Nvidia GPUs.

LLVM has support for SPIR-V codegen since LLVM 15 (April 2022) which is necessary to target Intel GPU or for OpenCL and Vulkan backends via LLVM IR.

LLVM supports AMD GPUs and Direct X target as well

Apple Metal

Compiling to Apple Metal cannot be done via LLVM IR despite Apple using LLVM :/ so it has to go through source-code.

Proof-of-concepts

You have a proof-of-concept LLVM JIT here:

https://github.com/mratsim/constantine/blob/998ca50/constantine/platforms/gpu/llvm.nim#L111-L150

And a proof-of-concept LLVM + Nvidia NVVM JIT here:

https://github.com/mratsim/constantine/blob/998ca50/constantine/platforms/gpu/nvidia.nim#L96-L300

Note: The conversion from LLVM IR to Nvidia PTX (assembly) can be done either through LLVM or through Nvidia NVVM which has extra proprietary optimization passes.

guibar (orginal) [2023-01-03T19:54:41+01:00] view original

Thanks for this overview, very interesting.

About compile-time generation, for info I experimented with a CUDA-like approach based on nlvm some time ago: https://github.com/guibar64/axel. It is still quite unfinished, though.

Also, your project is compiled twice for CPU and GPU (nvptx for now) respectively, which is admittedly a bit clunky.

mratsim (orginal) [2023-01-04T13:30:20+01:00] view original

About compile-time generation, for info I experimented with a CUDA-like approach based on nlvm some time ago: https://github.com/guibar64/axel. It is still quite unfinished, though.

I wasn't aware of Axel and it's definitely very interesting. I'm surprised you managed to have {.kernel.} work without going through the pain of rewriting the AST to avoid Nim artifacts like in https://github.com/jcosborn/cudanim/blob/338be78/inline.nim

Also, your project is compiled twice for CPU and GPU (nvptx for now) respectively, which is admittedly a bit clunky.

Which project?

guibar (orginal) [2023-01-04T18:18:00+01:00] view original

Sorry, bad phrasing, I was commenting on axel.

What I meant was, when you compile a project with axel, with staticExec a second compilation is done on your project with nlvm-gpu.

guibar (orginal) [2023-01-04T18:29:36+01:00] view original

I wasn't aware of Axel and it's definitely very interesting. I'm surprised you managed to have {.kernel.} work without going through the pain of rewriting the AST to avoid Nim artifacts like in https://github.com/jcosborn/cudanim/blob/338be78/inline.nim

kernel mostly tags the proc with a custom pragma, it is then handled in the modified nlvm to add proper annotations. The rest is already handled by nlvm's main codegen.

elcritch (orginal) [2023-01-05T05:05:19+01:00] view original

Thanks for the writeup @mratsim!

It'd be great to be able to do GPU / NN stuff in Nim (again).

Note that I really dislike LLVM -- it sucks as a user because the libraries break compatibility regularly. Your distro provide LLVM13 but your compiler needs LLVM14, etc. Then Apple's LLVM builds often don't work with regular LLVM libs. I guess that's just part of the nature of using any of the machine learning libraries though. :/

I'm not sure if pure CUDA C++ kernels have the same issue too. Also for the C++ compilation flags it would be possible to compile to source and then manually compile the generated C++ code. For CUDA kernels I'd guess that Nim features that use GNU features like computed goto's wouldn't be used.

mratsim (orginal) [2023-01-05T11:42:02+01:00] view original

It'd be great to be able to do GPU / NN stuff in Nim (again).

Unfortunately for the time being my focus is cryptography. For GPU / NN, I think Flambeau is the better bet.

It will have a side-effect of solving Cuda/OpenCL/Vulkan codegen via LLVM but the kernels still would have to be (re-)implemented.

Note that I really dislike LLVM -- it sucks as a user because the libraries break compatibility regularly. Your distro provide LLVM13 but your compiler needs LLVM14, etc. Then Apple's LLVM builds often don't work with regular LLVM libs. I guess that's just part of the nature of using any of the machine learning libraries though. :/

That was one of my concerns:

There has been talked about changing the JIT in LLVM for MCJIT to ORCv1 then ORCv2, the documentation for ORC is still WIP: https://llvm.org/docs/tutorial/BuildingAJIT1.html and has been for like 4 years now: > Warning: This tutorial is currently being updated to account for ORC API changes. Only Chapters 1 and 2 are up-to-date. > Example code from Chapters 3 to 5 will compile and run, but has not been updated

The latest cuda and latest Clang/LLVM version mismatch in distros for Nvidia codegen via Clang.

However, I don't think it will be an issue for my libraries:

Instead of hardcoding a libLLVM-15.so version, you can autodetect it with {.passl: gorge("llvm-config --libs").}

LLVM IR by necessity is very stable and hasn't been changed for years. The C API is also fully-featured (compared to many other popular C++ projects like OpenCV) and depended on by many languages like Rust, Julia, ...

Basically the only versioning woes I should get would be support for new backends, like OpenCL, OpenGL, Vulkan kernel generation via SPIR-V is only availble in LLVM-15 onward.

I'm not sure if pure CUDA C++ kernels have the same issue too.

Cuda C++ is also usually forward compatible with only new instructions like tensor cores, synchronization or unified memory that require new versions.

The latest big breakage was hardware-level with the RTX 2XXX series and Independent Thread Scheduling. GPU threads are organized in groups of 32 called a warp and within a warp they executed the same instructions, which also meant executing all branches of an if/then/else if at least one thread had to take a different branch from the other. RTX 2XXX and later allowed independent branching and since that was a decade old assumption, lots of synchronization code broke on those GPUs.

Also for the C++ compilation flags it would be possible to compile to source and then manually compile the generated C++ code. For CUDA kernels I'd guess that Nim features that use GNU features like computed goto's wouldn't be used.

If you do it at compile time:

I'm not sure you can portably do staticWrite("foo.cpp");compile("foo.cpp"), you might need 2-stage compilation

Or you have Nim -> Cuda codegen with the same woes as Arraymancer regarding compiler and compilation flags config.

If you do it at runtime, via NVRTC, which is my recommendation if you only want to support Nvidia, it should be the easiest to maintain and deploy. I'm only considering LLVM so I write LLVM IR and then all backends supported by LLVM (Nvidia, AMD, Intel via OpenCL/SPIR-V) are available.

elcritch (orginal) [2023-01-07T22:58:06+01:00] view original

Thanks for the detailed reply!

Instead of hardcoding a libLLVM-15.so version, you can autodetect it with {.passl: gorge("llvm-config --libs").}

That's handy to know.

LLVM IR by necessity is very stable and hasn't been changed for years. The C API is also fully-featured (compared to many other popular C++ projects like OpenCV) and depended on by many languages like Rust, Julia, ...

Do you mean the text or the binary format? I thought the binary format was not considered stable and only for usage with the current major version. Though, it's been a long while since I actually used either.

Either way, if you can target the LLVM IR rather than the C++ api it should be much more stable.

I'm not sure you can portably do staticWrite("foo.cpp");compile("foo.cpp"), you might need 2-stage compilation

No, that probably wouldn't work well. But more of a nim c --compileOnly someKernel.nim. Then you can write a Ninja file or something, or modify the build.sh that Nim produces.

Or you have Nim -> Cuda codegen with the same woes as Arraymancer regarding compiler and compilation flags config.

That's where I was figuring you could avoid lots of those woes by manually compiling the C++ kernels. Which would entail a two-stage compilation setup, but would avoid trying to batter Nim into using the right C++ flags. Perhaps I'm viewing it more similar to some of the embedded devices, which share a few features of GPU style targets -- weird libraries / compilers, differences in stdlibs, etc.

Though I don't have any concrete plans, but I'd like to know if the options are available.

I'll take a look at Flambeau. Is it usable? As in I don't mind updating things like older library versions etc, but is the core C++ interface fairly stable or usable with latest pytorch libs?

Also, could you point me to a good example of a CUDA kernel in arraymancer? I'd like to toy with the compile stuff a bit at some point.

Mirror of forum.nim-lang.org

9794 :: The state of GPU codegen with Nim (bonus: LLVM JIT codegen)