nimforum mirror - Fused multiply-add instruction

dlesnoff (orginal) [2024-02-22T14:17:04+01:00] view original

I wonder how one would use the fused multiply-add intrinsic defined by the IEEE754-2008 specification in Nim.

Right now, I only found these instructions (slightly modified) on a nine-year old Github gist:

{.passC: "-march=native".}

proc fma(x,y,z: float32): float32 {.importc: "fmaf", header: "<math.h>".}
proc fma(x,y,z: float64): float64 {.importc: "fma", header: "<math.h>".}

Is the option "-march=native" mandatory? Is this an omission in the standard library? Why are the instructions not available directly in Nim? I don't mind writing a pull request to add them to system/standard library.

On a slightly related note, is there a reason for math.floorMod to call the mod operator instead of using the fmod function from C++?

Best regards,

Lesnoff

awr1 (orginal) [2024-02-22T19:23:37+01:00] view original

there are many (recent!) x86 and ARM processors that do not support true non-rounded FMAs, e.g. the Celerons. implicitly setting -march=mative would also inherently makes many nim programs non-portable by design which is an assumption we really do not want to make. the libc impl does implicitly state that there should be a fallback however you would really need to make sure that whatever you're doing is portable (including working on the JS backend)

i don't really like how much std / math relies on libc anyway, but what can you do

dlesnoff (orginal) [2024-02-22T21:03:21+01:00] view original

Thanks for the additional information concerning CPUs and your opinion on the question.

I do not think that -march=native is mandatory for FMAs. The JS backend is problematic, we probably have to either emulate the FMA or replace it with an axpy. I would go the first route.

I have opened an issue in the Nim Github repository: https://github.com/nim-lang/Nim/issues/23342#issuecomment-1959812536.

arnetheduck (orginal) [2024-02-23T08:59:12+01:00] view original

https://godbolt.org/z/MrqTno7xs - FMA will be used if there's a reasonable instruction for it - -march=haswell for example has one.

Regarding rounding, IEEE allows the use of FMA even if rounding differs: https://stackoverflow.com/questions/34436233/fused-multiply-add-and-default-rounding-modes/34817983#34817983 though there needs to be a way to choose, which I guess nim does not have.

mratsim (orginal) [2024-03-04T13:17:01+01:00] view original

In general, for high performance computing, you might as well use SIMD directly for performance: see fmadd https://github.com/mratsim/Arraymancer/blob/7d6d21c/src/arraymancer/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L27

ukernel_generator(
      x86_AVX_FMA,
      typ = float32,
      vectype = m256,
      nb_scalars = 8,
      simd_setZero = mm256_setzero_ps,
      simd_broadcast_value = mm256_set1_ps,
      simd_load_aligned = mm256_load_ps,
      simd_load_unaligned = mm256_loadu_ps,
      simd_store_unaligned = mm256_storeu_ps,
      simd_mul = mm256_mul_ps,
      simd_add = mm256_add_ps,
      simd_fma = mm256_fmadd_ps
    )

ukernel_generator(
      x86_AVX_FMA,
      typ = float64,
      vectype = m256d,
      nb_scalars = 4,
      simd_setZero = mm256_setzero_pd,
      simd_broadcast_value = mm256_set1_pd,
      simd_load_aligned = mm256_load_pd,
      simd_load_unaligned = mm256_loadu_pd,
      simd_store_unaligned = mm256_storeu_pd,
      simd_mul = mm256_mul_pd,
      simd_add = mm256_add_pd,
      simd_fma = mm256_fmadd_pd,
    )

Mirror of forum.nim-lang.org

11064 :: Fused multiply-add instruction