nimforum mirror - JIT compiling to AMD GPUs

mratsim (orginal) [2024-08-05T07:22:29+02:00] view original

In my latest PR I managed to JIT compile from LLVM IR to AMD GPU:

https://github.com/mratsim/constantine/pull/453

Now the question becomes can e do Nim -> AMD GPU (or Nvidia as AMD toolchain supports both).

As we have a LLVM-IR => AMD (and LLVM-IR => Nvidia) we can use those as NLVM backends.

Alternatively, all GPU languages are C-like (AMD Hip, Nvidia Cuda, OpenCL) and WebGPU is C/Rust/OCaml/Nim-like, so it should be possible to add a Nim backend towards those langs and then use runtime compilation:

hipRTC: https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html

NVRTC: https://docs.nvidia.com/cuda/nvrtc/index.html

OpenCL: https://github.com/nim-lang/opencl/blob/master/src/opencl.nim#791

Clonk (orginal) [2024-08-05T13:04:37+02:00] view original

Native GPU compilation would be dope

xrfez1 (orginal) [2024-08-05T17:29:33+02:00] view original

Is this limited to GCN or is Navi supported?

mratsim (orginal) [2024-08-06T06:11:33+02:00] view original

See https://llvm.org/docs/AMDGPUUsage.html#processors

I develop on RDNA3 so anything between Radeon HD 7790 (2014) to today.

planetis (orginal) [2024-08-06T17:28:55+02:00] view original

Awesome, does that mean that I can write something like the following and have it run on the GPU?


proc reductionShader(env: GlEnvironment, barrier: BarrierHandle,
                     buffers: Locker[tuple[input: seq[int32], output: Atomic[int32]]],
                     smem: ptr seq[int32], n: uint) {.gcsafe.} =
  let localIdx = env.gl_LocalInvocationID.x
  let localSize = env.gl_WorkGroupSize.x
  let gridSize = localSize * 2 * env.gl_NumWorkGroups.x
  var globalIdx = env.gl_WorkGroupID.x * localSize * 2 + localIdx
  
  var sum: int32 = 0
  while globalIdx < n:
    # echo "ThreadId ", localIdx, " indices: ", globalIdx, " + ", globalIdx + localSize
    unprotected buffers as b:
      sum = sum + b.input[globalIdx] + b.input[globalIdx + localSize]
    globalIdx = globalIdx + gridSize
  smem[localIdx] = sum
  
  wait barrier
  var stride = localSize div 2
  while stride > 0:
    if localIdx < stride:
      # echo "Final reduction ", localIdx, " + ", localIdx + stride
      smem[localIdx] += smem[localIdx + stride]
    wait barrier # was memoryBarrierShared
    stride = stride div 2
  
  if localIdx == 0:
    unprotected buffers as b:
      atomicInc b.output, smem[0]

monofuel (orginal) [2024-08-06T18:25:34+02:00] view original

This year I read through Cuda by Example, and decided that I wanted to try getting all of the examples working with Nim.

I started working on a library named Hippo to add templates and macros to enable programming CUDA C or HIP in Nim. I got the basics working with multiple targets including Cuda / Nvidia, Hip -> rocm, HIP -> cuda, and CPU only with HIP-CPU (handy for debugging). https://github.com/monofuel/hippo

I recently got a Nim PR merged adding backends for nvcc and hipcc, now available in Nim >= 2.1.9. both CUDA C and HIP require using the C++ backend for Nim.

It still needs a lot more work, but I'm amazed that I've made at least this much progress. Here's an example of a julia set generator using Hippo: https://github.com/monofuel/hippo/blob/master/tests/hip/julia.nim

my workflow has been to do the exercises from the book in CUDA C, then port them to HIP (usually as easy as just running hipify) and then write it with Nim + hippo. There is room for improvement to make the library more nim-y, but things are at least working.

mratsim (orginal) [2025-01-27T10:45:47+01:00] view original

@planetis, sorry I missed your question. At the moment no because I write LLVM IR directly. However using the technique from the following PR https://github.com/mratsim/constantine/pull/487 it should be possible. Note that it is Cuda focused but should be straightforward to adapt it to AMD.

@monofuel, I've seen your nvcc/hipcc PR and hippo, great work!

Mirror of forum.nim-lang.org

12184 :: JIT compiling to AMD GPUs