I wonder how one would use the fused multiply-add intrinsic defined by the IEEE754-2008 specification in Nim.
Right now, I only found these instructions (slightly modified) on a nine-year old Github gist:
{.passC: "-march=native".}
proc fma(x,y,z: float32): float32 {.importc: "fmaf", header: "<math.h>".}
proc fma(x,y,z: float64): float64 {.importc: "fma", header: "<math.h>".}
Is the option "-march=native" mandatory? Is this an omission in the standard library? Why are the instructions not available directly in Nim? I don't mind writing a pull request to add them to system/standard library.
On a slightly related note, is there a reason for math.floorMod to call the mod operator instead of using the fmod function from C++?
Best regards,
there are many (recent!) x86 and ARM processors that do not support true non-rounded FMAs, e.g. the Celerons. implicitly setting -march=mative would also inherently makes many nim programs non-portable by design which is an assumption we really do not want to make. the libc impl does implicitly state that there should be a fallback however you would really need to make sure that whatever you're doing is portable (including working on the JS backend)
i don't really like how much std / math relies on libc anyway, but what can you do
Thanks for the additional information concerning CPUs and your opinion on the question.
I do not think that -march=native is mandatory for FMAs. The JS backend is problematic, we probably have to either emulate the FMA or replace it with an axpy. I would go the first route.
I have opened an issue in the Nim Github repository: https://github.com/nim-lang/Nim/issues/23342#issuecomment-1959812536.
https://godbolt.org/z/MrqTno7xs - FMA will be used if there's a reasonable instruction for it - -march=haswell for example has one.
Regarding rounding, IEEE allows the use of FMA even if rounding differs: https://stackoverflow.com/questions/34436233/fused-multiply-add-and-default-rounding-modes/34817983#34817983 though there needs to be a way to choose, which I guess nim does not have.
In general, for high performance computing, you might as well use SIMD directly for performance: see fmadd https://github.com/mratsim/Arraymancer/blob/7d6d21c/src/arraymancer/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L27
ukernel_generator(
x86_AVX_FMA,
typ = float32,
vectype = m256,
nb_scalars = 8,
simd_setZero = mm256_setzero_ps,
simd_broadcast_value = mm256_set1_ps,
simd_load_aligned = mm256_load_ps,
simd_load_unaligned = mm256_loadu_ps,
simd_store_unaligned = mm256_storeu_ps,
simd_mul = mm256_mul_ps,
simd_add = mm256_add_ps,
simd_fma = mm256_fmadd_ps
)
ukernel_generator(
x86_AVX_FMA,
typ = float64,
vectype = m256d,
nb_scalars = 4,
simd_setZero = mm256_setzero_pd,
simd_broadcast_value = mm256_set1_pd,
simd_load_aligned = mm256_load_pd,
simd_load_unaligned = mm256_loadu_pd,
simd_store_unaligned = mm256_storeu_pd,
simd_mul = mm256_mul_pd,
simd_add = mm256_add_pd,
simd_fma = mm256_fmadd_pd,
)