nimforum mirror - intops - core primtives for working with integers efficiently

arnetheduck (orginal) [2023-05-09T22:03:14+02:00] view original

Started work on some basic integer primitives: https://github.com/status-im/nim-stew/pull/187 (actually , I started a few years ago and recently rediscovered them in my git stash while spring cleaning)

The ops are intended primarily for implementing other, presumably safer, integer operations (big integer support like https://github.com/status-im/nim-stint/, multi-precision libraries, crypto stuff etc) but also for working with saturating, overflowing and wrapping arithmetic in general - this is often a lot more convenient than having to worry about pesky Defects ruining the day.

Similar functions already exist to implement the integers themselves in nim - they are however not meant to be used directly hence a new set of public API for working with them.

I'm raising the PR early mainly to get the bikeshedding on names, calling conventions etc out of the way - if you have interest in primitives like this, do take a look and feel free to comment.

A few open questions I'm interested in:

division is left out for now - unsigned division is overflow-safe but raises a defect for division by 0 - potentially this could be covered by a separate set of functions called divChecked, addChecked etc which would return Opt[T] or a more precise error

more complex operations like divmod etc are also left out, though common on CPU:s - similar story: the API needs to decide what to do with the 0

ditto shr, shl - these are solvable but what is a "useful" api for them?

should "wrapping" (long-name versions of uint ops) and "saturating" versions be in there? checked versions (ie such that return`Opt[T]`)?

mratsim (orginal) [2023-05-09T22:47:14+02:00] view original

C23 will have _BitInt(N) arbitrarily bit-sized integers (proposal to standardize Clang _ExtInt) and this can effectively replace Stint.

crypto stuff, the carry part in the code uses comparisons and so is not constant-time, so cannot be used for crypto.

> division is left out for now - unsigned division is overflow-safe but raises a defect for division by 0 - potentially this could be covered by a separate set of functions called divChecked, addChecked etc which would return Opt[T] or a more precise error

more complex operations like divmod etc are also left out, though common on CPU:s - similar story: the API needs to decide what to do with the 0

I would make the primitive API divrem(quotient, remainder: var uint64, aHi, aLo, b: uint54): bool with the bool indicating an error. In hardware, division can throw a SIGFPE not only on b == 0 but if aHi > b because then the result can overflow quotient/remainder.

should "wrapping" (long-name versions of uint ops) and "saturating" versions be in there? checked versions (ie such that return`Opt[T]`)?

I personally don't use saturating add/sub but LLVM does provide those as intrinsics for both signed and unsigned integers.

The ops are intended primarily for implementing other, presumably safer, integer operations (big integer support like https://github.com/status-im/nim-stint/

Re stint, the plan in https://github.com/status-im/nim-stint/pull/104 total refactoring was also to build a core set of primitives (https://github.com/status-im/nim-stint/tree/59dd89638128636d4b636f25048bbda5bc7f5a4c/stint/private/primitives):

addcarry

subborrow

div2n1n, for division of a 128-bit word by 64 -> (64-bit quotient, 64-bit remainder)

extended-precision mul, for 64x64 -> 128

extended precision a*b + c

extended-precision a*b+c+d

mul-accumulate: (t, u, v) <- (t, u, v) + a * b with [t, u, v] representing (carry accumulator, 128-bit number)

This mirrors Constantine (https://github.com/mratsim/constantine/tree/1c5341f/constantine/platforms/intrinsics) and is enough to implement the fastest™ unsigned bigint backend.

Do note that Clang generates significantly better code for arithmetic, especially following the improvement made for the Ethereum EVM and intx (https://github.com/chfast/intx) by chfast (https://github.com/llvm/llvm-project/issues?q=author%3Achfast+)

arnetheduck (orginal) [2023-05-09T23:15:20+02:00] view original

C23

...and in clang under the name _ExtInt now already - but that's far out, ie it'll take 5-10 years for such things to permeate through the ecosystem

carry part in the code uses comparisons

have a good reference on why comparison operators are problematic? ie when there's no actual branching based on it?

was also to build a core set of primitives

yes, I remember we discussed it at some point but I didn't actually know there was code written - nice!

will take a look - supporting stint, but more generally would indeed be one target here (though the focus is primarily on exposing what compilers compilers have readily available today already and maybe build the somthing like the stint / weave primitives on top) - this to make progress on something small while things like big refactorings often get stuck for years.

Do note that Clang generates significantly better code for arithmetic,

interesting, didn't know chfast had actually gone all the way to the compiler with his work - it's good news though, my next plan for intops was to turn them into nlvm intrinsics as well which means we should be able to get assembly-level performance without writing assembly, hopefully.

arnetheduck (orginal) [2023-05-10T07:35:08+02:00] view original

crypto stuff

also, my assumption here would be that the built-ins don't have any comparison problems since they fetch the carry bit from a cpu flag - ie any time you're doing "crypto stuff", you kind of make assumptions about the compiler you're working with given the lack of standardization around constant-time-ness.

Given the state of compilers, it's perhaps slightly beyond the reach of the primitives here to test for that, although it would indeed be a huge advantage if it was verified in unit tests. LLVM for example tests the assembly output of a given program in certain cases, it would actually be pretty cool to have a test that ensures that for a given platform, a "construct" compiles to a certain instruction so that "crypto stuff" can be built more reliably.

This also reminds me of C++ atomics where the standard provides a compile-time constant that describes whether they are true atomics or emulated with locks on any given platform - something similar would perhaps be interesting here, ie that on "verified" platforms, it would be set to true to signal to the calling library that it's "safe".

This would again be trivial to implement in nlvm - in nim, it's a bit more tricky because of the interaction with the underlying C compiler zoo.

mratsim (orginal) [2023-05-10T12:24:10+02:00] view original

have a good reference on why comparison operators are problematic? ie when there's no actual branching based on it?

As the branch predictors improve, compilers tend to be updated to reintroduce branches, see:

What you get is what you C: Controlling side effects in mainstream C compilers
- https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf
- https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc-slides.pdf

1 int ct_is_non_zero_u32(uint32_t x) {
2   return (x | -x)>>31;
3 }
4 uint32_t ct_mask_u32(uint32_t bit) {
5   return -(uint32_t) ct_is_nonzero_u32(bit);
6 }
7 uint32_t ct_select_u32(uint32_t x, uint32_t y,
      bool bit /* ={0,1} */) {
8 // VERSION 1
9   uint32_t m = ct_mask_u32(bit);
10  return (x&m) | (y&~m);
11
12  // VERSION 2. Same as VERSION 1 but without
13  // using multiple functions
14  uint32_t m= -(uint32_t)((x|-x)>>31) ;
15  return (x&m) | (y&~m);
16
17  // VERSION 3
18  signed b = 1 - bit;
19  return (x*bit)|(y*b);
20
21  // VERSION 4
22  signed b = 0 - bit;
23  return (x&b) | (y&~b);
24
25 }

On Clang 3.0, all the versions where constant-time, in 3.9 only the slowest version 1 with separate functions was left constant-time, branches were introduced in the rest.

See also https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/ where a seemingly constant-time source was defeated by the Microsoft compiler using an "optimized" 64-bit multiplication, which led to leaking a private key.

In short, and sadly, you need assembly not only for performance but also for security for cryptography.

my assumption here would be that the built-ins don't have any comparison problems since they fetch the carry bit from a cpu flag - ie any time you're doing "crypto stuff", you kind of make assumptions about the compiler you're working with given the lack of standardization around constant-time-ness.

This is also covered by previous answer.

One thing to note is that surprisingly, compilers have no notion of carry-flag or overflow-flag in their IR, this is something that is one-level lower, for LLVM in the MIR (MachineIR). The proper generation of add-with-carry depends on pattern matching, i.e. the "missed optimization" reports from @chfast. Unfortunately compilers optimize for speed and there is no feature-flag for "secret data, don't branch".

2 bad codegen cases:

compiler differences, https://gcc.godbolt.org/z/2h768y

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC 9 asm


add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang 9 asm (perfect and there is an unit-test)


add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

Same compiler, different backend

Annoyingly, while some pattern matching or builtins might work well on x86-64, they don't on other architectures, like Nvidia GPU. Even when using builtin types like _ExtInt(128) that directly map to LLVM i128.

// Compile with LLVM

// /usr/lib/llvm13/bin/clang++ -S -emit-llvm \
//     build/nvidia/wideint128.cu \
//     --cuda-gpu-arch=sm_86 \
//     -L/opt/cuda/lib64             \
//     -lcudart_static -ldl -lrt -pthread

// /usr/lib/llvm13/bin/clang++ build/nvidia/wideint128.cu \
//     -o build/nvidia/wideint128 \
//     --cuda-gpu-arch=sm_86 \
//     -L/opt/cuda/lib64             \
//     -lcudart_static -ldl -lrt -pthread

// llc -mcpu=sm_86 build/nvidia/wideint128-cuda-nvptx64-nvidia-cuda-sm_86.ll -o build/nvidia/wideint128_llvm.ptx

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cstdint>
#include <stdio.h>

typedef _ExtInt(128) u128;

cudaError_t add128();

__global__ void add128Kernel(u128* r, u128 a, u128 b) {
    *r = a + b;
    // for (int i = 0; i < 16; i++) {
    //     printf("%02X", ((unsigned char*)(r))[i]);
    // }
}

int main()
{
    cudaError_t cudaStatus = add128();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }
        
        getchar();
    
    return 0;
}

cudaError_t add128()
{
   cudaError_t cudaStatus;
    cudaStatus = cudaSetDevice(0);
  
  u128 r;
  u128 a = 0xFFFFFFFFFFFFFFFFULL;
  u128 b = 0x0010000000000000ULL;
        
        add128Kernel<<<1, 1>>>(&r, a, b);
    
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }
Error:
    return cudaStatus;
}

LLVM IR (properly uses i128)


; ModuleID = 'build/nvidia/wideint128.cu'
source_filename = "build/nvidia/wideint128.cu"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

; Function Attrs: convergent mustprogress noinline norecurse nounwind optnone
define dso_local void @_Z12add128KernelPU7_ExtIntILi128EEiS_S_(i128* %0, i128 %1, i128 %2) #0 {
  %4 = alloca i128*, align 8
  %5 = alloca i128, align 8
  %6 = alloca i128, align 8
  store i128* %0, i128** %4, align 8
  store i128 %1, i128* %5, align 8
  store i128 %2, i128* %6, align 8
  %7 = load i128, i128* %5, align 8
  %8 = load i128, i128* %6, align 8
  %9 = add nsw i128 %7, %8
  %10 = load i128*, i128** %4, align 8
  store i128 %9, i128* %10, align 8
  ret void
}

attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_86" "target-features"="+ptx72,+sm_86" }

!llvm.module.flags = !{!0, !1, !2, !3}
!nvvm.annotations = !{!4}
!llvm.ident = !{!5, !6}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 11, i32 2]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!3 = !{i32 7, !"frame-pointer", i32 2}
!4 = !{void (i128*, i128, i128)* @_Z12add128KernelPU7_ExtIntILi128EEiS_S_, !"kernel", i32 1}
!5 = !{!"clang version 13.0.1"}
!6 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}

Nvidia PTX


//
// Generated by LLVM NVPTX Back-End
//

.version 7.1
.target sm_86
.address_size 64
        
        // .globl	_Z12add128KernelPU7_ExtIntILi128EEiS_S_ // -- Begin function _Z12add128KernelPU7_ExtIntILi128EEiS_S_
                                        // @_Z12add128KernelPU7_ExtIntILi128EEiS_S_
.visible .entry _Z12add128KernelPU7_ExtIntILi128EEiS_S_(
        .param .u64 _Z12add128KernelPU7_ExtIntILi128EEiS_S__param_0,
        .param .align 16 .b8 _Z12add128KernelPU7_ExtIntILi128EEiS_S__param_1[16],
        .param .align 16 .b8 _Z12add128KernelPU7_ExtIntILi128EEiS_S__param_2[16]
)
{
        .local .align 8 .b8 	__local_depot0[40];
        .reg .b64 	%SP;
        .reg .b64 	%SPL;
        .reg .pred 	%p<3>;
        .reg .b64 	%rd<18>;

// %bb.0:
        mov.u64 	%SPL, __local_depot0;
        cvta.local.u64 	%SP, %SPL;
        ld.param.v2.u64 	{%rd4, %rd5}, [_Z12add128KernelPU7_ExtIntILi128EEiS_S__param_2];
        ld.param.v2.u64 	{%rd2, %rd3}, [_Z12add128KernelPU7_ExtIntILi128EEiS_S__param_1];
        ld.param.u64 	%rd1, [_Z12add128KernelPU7_ExtIntILi128EEiS_S__param_0];
        cvta.to.global.u64 	%rd6, %rd1;
        cvta.global.u64 	%rd7, %rd6;
        st.u64 	[%SP+0], %rd7;
        st.u64 	[%SP+16], %rd3;
        st.u64 	[%SP+8], %rd2;
        st.u64 	[%SP+32], %rd5;
        st.u64 	[%SP+24], %rd4;
        ld.u64 	%rd8, [%SP+16];
        ld.u64 	%rd9, [%SP+8];
        ld.u64 	%rd10, [%SP+32];
        ld.u64 	%rd11, [%SP+24];
        add.s64 	%rd12, %rd9, %rd11;
        setp.lt.u64 	%p1, %rd12, %rd11;
        setp.lt.u64 	%p2, %rd12, %rd9;
        selp.u64 	%rd13, 1, 0, %p2;
        selp.b64 	%rd14, 1, %rd13, %p1;
        add.s64 	%rd15, %rd8, %rd10;
        add.s64 	%rd16, %rd15, %rd14;
        ld.u64 	%rd17, [%SP+0];
        st.u64 	[%rd17], %rd12;
        st.u64 	[%rd17+8], %rd16;
        ret;
                                        // -- End function
}

but PTX supports:

addc for add with carry-in,

add.cc for add with carry-out

addc.cc for add with carry in and carry-out

Given the state of compilers, it's perhaps slightly beyond the reach of the primitives here to test for that, although it would indeed be a huge advantage if it was verified in unit tests. LLVM for example tests the assembly output of a given program in certain cases, it would actually be pretty cool to have a test that ensures that for a given platform, a "construct" compiles to a certain instruction so that "crypto stuff" can be built more reliably.

Yes, I actually spent some time looking into LLVM tests, for example add256 at https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/addcarry.ll#L47-L63 but those are architecture specific, no such guarantee on Nvidia GPUs, so I need my own GPU code generator (via LLVM IR inline assembly) despite not having constant-time requirement on GPUs :/

arnetheduck (orginal) [2023-05-10T13:00:23+02:00] view original

One thing to note is that surprisingly, compilers have no notion of carry-flag or overflow-flag in their IR,

well, the builtins do: they simply return the result + carry in a tuple: https://llvm.org/docs/LangRef.html#llvm-uadd-with-overflow-intrinsics which is close enough as far as IR goes.

So the point of these functions is really to abstract away all the mess and provide functions that return precisely the information needed to implement "common" integer stuff based on the language primitives nim offers (which for how is limited to 64 bit) - they describe a perfect underlying compiler world towards which the underlying optimizers can strive - the focus must thus be to accurately and minimally describe constraints, input and outputs so the optimizers have the right information to work with.

Even with underlying` ExtInt` support, Nim still needs a library to "expose" it - stint could very well be that library and it would use either these primitives or C23 etc - even later though will these primitives be needed, ie for arbitrary / runtime precision work.

mratsim (orginal) [2023-05-10T14:13:30+02:00] view original

The @llvm.uadd.with.overflow builtins are unfortunately not enough

They have a signature of


%res = call {i64, i1} @llvm.uadd.with.overflow.i64(i64 %a, i64 %b)

but we need the, carry-in, the same as the x86-only intrinsics


%res = call { i8, i64 } @llvm.x86.addcarry.64(i8 %carry_in, i64 %a, i64 %b)

I tried to look in LLVM if they compile the x86 intrinsics to the LLVM intrinsics at IR level for example here (https://github.com/llvm/llvm-project/blob/d526e2e/llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp#L544-L555) but no it directly creates an ADDCARRY ISel (instruction selection) node.

So at the LLVM level there are 4 solutions for i128 addition

x86-only, use @llvm.uadd.with.overflow.i64

portable, use the @llvm.uadd.with.overflow.i64 intrinsics, but with 3 additions, in pseudo-code:


      (c_out, lo) = a_lo + b_lo
      (_, hi) = a_hi + b_hi
      hi += c_out

hoping that they will be merged into add+adc

portable, use i128, hoping that the target backend is well optimized (i.e. not Nvidia)

inline assembly

Araq (orginal) [2023-05-10T15:30:37+02:00] view original

But that's just some temporary problem, in the long run RISC V takes over everything which cannot be bothered with such things as "carry flags" because the entire architecture's design was done to run 4 old single-threaded unsafe C programs well...

(Sorry to digress but I really dislike RISC V.)

Mirror of forum.nim-lang.org

10187 :: intops - core primtives for working with integers efficiently