Started work on some basic integer primitives: https://github.com/status-im/nim-stew/pull/187 (actually , I started a few years ago and recently rediscovered them in my git stash while spring cleaning)
The ops are intended primarily for implementing other, presumably safer, integer operations (big integer support like https://github.com/status-im/nim-stint/, multi-precision libraries, crypto stuff etc) but also for working with saturating, overflowing and wrapping arithmetic in general - this is often a lot more convenient than having to worry about pesky Defects ruining the day.
Similar functions already exist to implement the integers themselves in nim - they are however not meant to be used directly hence a new set of public API for working with them.
I'm raising the PR early mainly to get the bikeshedding on names, calling conventions etc out of the way - if you have interest in primitives like this, do take a look and feel free to comment.
A few open questions I'm interested in:
C23 will have _BitInt(N) arbitrarily bit-sized integers (proposal to standardize Clang _ExtInt) and this can effectively replace Stint.
crypto stuff, the carry part in the code uses comparisons and so is not constant-time, so cannot be used for crypto.
> division is left out for now - unsigned division is overflow-safe but raises a defect for division by 0 - potentially this could be covered by a separate set of functions called divChecked, addChecked etc which would return Opt[T] or a more precise error
more complex operations like divmod etc are also left out, though common on CPU:s - similar story: the API needs to decide what to do with the 0
I would make the primitive API divrem(quotient, remainder: var uint64, aHi, aLo, b: uint54): bool with the bool indicating an error. In hardware, division can throw a SIGFPE not only on b == 0 but if aHi > b because then the result can overflow quotient/remainder.
should "wrapping" (long-name versions of uint ops) and "saturating" versions be in there? checked versions (ie such that return`Opt[T]`)?
I personally don't use saturating add/sub but LLVM does provide those as intrinsics for both signed and unsigned integers.
The ops are intended primarily for implementing other, presumably safer, integer operations (big integer support like https://github.com/status-im/nim-stint/
Re stint, the plan in https://github.com/status-im/nim-stint/pull/104 total refactoring was also to build a core set of primitives (https://github.com/status-im/nim-stint/tree/59dd89638128636d4b636f25048bbda5bc7f5a4c/stint/private/primitives):
This mirrors Constantine (https://github.com/mratsim/constantine/tree/1c5341f/constantine/platforms/intrinsics) and is enough to implement the fastestâ„¢ unsigned bigint backend.
Do note that Clang generates significantly better code for arithmetic, especially following the improvement made for the Ethereum EVM and intx (https://github.com/chfast/intx) by chfast (https://github.com/llvm/llvm-project/issues?q=author%3Achfast+)
C23
...and in clang under the name _ExtInt now already - but that's far out, ie it'll take 5-10 years for such things to permeate through the ecosystem
carry part in the code uses comparisons
have a good reference on why comparison operators are problematic? ie when there's no actual branching based on it?
was also to build a core set of primitives
yes, I remember we discussed it at some point but I didn't actually know there was code written - nice!
will take a look - supporting stint, but more generally would indeed be one target here (though the focus is primarily on exposing what compilers compilers have readily available today already and maybe build the somthing like the stint / weave primitives on top) - this to make progress on something small while things like big refactorings often get stuck for years.
Do note that Clang generates significantly better code for arithmetic,
interesting, didn't know chfast had actually gone all the way to the compiler with his work - it's good news though, my next plan for intops was to turn them into nlvm intrinsics as well which means we should be able to get assembly-level performance without writing assembly, hopefully.
crypto stuff
also, my assumption here would be that the built-ins don't have any comparison problems since they fetch the carry bit from a cpu flag - ie any time you're doing "crypto stuff", you kind of make assumptions about the compiler you're working with given the lack of standardization around constant-time-ness.
Given the state of compilers, it's perhaps slightly beyond the reach of the primitives here to test for that, although it would indeed be a huge advantage if it was verified in unit tests. LLVM for example tests the assembly output of a given program in certain cases, it would actually be pretty cool to have a test that ensures that for a given platform, a "construct" compiles to a certain instruction so that "crypto stuff" can be built more reliably.
This also reminds me of C++ atomics where the standard provides a compile-time constant that describes whether they are true atomics or emulated with locks on any given platform - something similar would perhaps be interesting here, ie that on "verified" platforms, it would be set to true to signal to the calling library that it's "safe".
This would again be trivial to implement in nlvm - in nim, it's a bit more tricky because of the interaction with the underlying C compiler zoo.
have a good reference on why comparison operators are problematic? ie when there's no actual branching based on it?
As the branch predictors improve, compilers tend to be updated to reintroduce branches, see:
1 int ct_is_non_zero_u32(uint32_t x) {
2 return (x | -x)>>31;
3 }
4 uint32_t ct_mask_u32(uint32_t bit) {
5 return -(uint32_t) ct_is_nonzero_u32(bit);
6 }
7 uint32_t ct_select_u32(uint32_t x, uint32_t y,
bool bit /* ={0,1} */) {
8 // VERSION 1
9 uint32_t m = ct_mask_u32(bit);
10 return (x&m) | (y&~m);
11
12 // VERSION 2. Same as VERSION 1 but without
13 // using multiple functions
14 uint32_t m= -(uint32_t)((x|-x)>>31) ;
15 return (x&m) | (y&~m);
16
17 // VERSION 3
18 signed b = 1 - bit;
19 return (x*bit)|(y*b);
20
21 // VERSION 4
22 signed b = 0 - bit;
23 return (x&b) | (y&~b);
24
25 }
On Clang 3.0, all the versions where constant-time, in 3.9 only the slowest version 1 with separate functions was left constant-time, branches were introduced in the rest.
In short, and sadly, you need assembly not only for performance but also for security for cryptography.
my assumption here would be that the built-ins don't have any comparison problems since they fetch the carry bit from a cpu flag - ie any time you're doing "crypto stuff", you kind of make assumptions about the compiler you're working with given the lack of standardization around constant-time-ness.
This is also covered by previous answer.
One thing to note is that surprisingly, compilers have no notion of carry-flag or overflow-flag in their IR, this is something that is one-level lower, for LLVM in the MIR (MachineIR). The proper generation of add-with-carry depends on pattern matching, i.e. the "missed optimization" reports from @chfast. Unfortunately compilers optimize for speed and there is no feature-flag for "secret data, don't branch".
2 bad codegen cases:
#include <stdint.h>
#include <x86intrin.h>
void add256(uint64_t a[4], uint64_t b[4]){
uint8_t carry = 0;
for (int i = 0; i < 4; ++i)
carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}
GCC 9 asm
add256:
movq (%rsi), %rax
addq (%rdi), %rax
setc %dl
movq %rax, (%rdi)
movq 8(%rdi), %rax
addb $-1, %dl
adcq 8(%rsi), %rax
setc %dl
movq %rax, 8(%rdi)
movq 16(%rdi), %rax
addb $-1, %dl
adcq 16(%rsi), %rax
setc %dl
movq %rax, 16(%rdi)
movq 24(%rsi), %rax
addb $-1, %dl
adcq %rax, 24(%rdi)
ret
Clang 9 asm (perfect and there is an unit-test)
add256:
movq (%rsi), %rax
addq %rax, (%rdi)
movq 8(%rsi), %rax
adcq %rax, 8(%rdi)
movq 16(%rsi), %rax
adcq %rax, 16(%rdi)
movq 24(%rsi), %rax
adcq %rax, 24(%rdi)
retq
Annoyingly, while some pattern matching or builtins might work well on x86-64, they don't on other architectures, like Nvidia GPU. Even when using builtin types like _ExtInt(128) that directly map to LLVM i128.
// Compile with LLVM
// /usr/lib/llvm13/bin/clang++ -S -emit-llvm \
// build/nvidia/wideint128.cu \
// --cuda-gpu-arch=sm_86 \
// -L/opt/cuda/lib64 \
// -lcudart_static -ldl -lrt -pthread
// /usr/lib/llvm13/bin/clang++ build/nvidia/wideint128.cu \
// -o build/nvidia/wideint128 \
// --cuda-gpu-arch=sm_86 \
// -L/opt/cuda/lib64 \
// -lcudart_static -ldl -lrt -pthread
// llc -mcpu=sm_86 build/nvidia/wideint128-cuda-nvptx64-nvidia-cuda-sm_86.ll -o build/nvidia/wideint128_llvm.ptx
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cstdint>
#include <stdio.h>
typedef _ExtInt(128) u128;
cudaError_t add128();
__global__ void add128Kernel(u128* r, u128 a, u128 b) {
*r = a + b;
// for (int i = 0; i < 16; i++) {
// printf("%02X", ((unsigned char*)(r))[i]);
// }
}
int main()
{
cudaError_t cudaStatus = add128();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
getchar();
return 0;
}
cudaError_t add128()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
u128 r;
u128 a = 0xFFFFFFFFFFFFFFFFULL;
u128 b = 0x0010000000000000ULL;
add128Kernel<<<1, 1>>>(&r, a, b);
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
Error:
return cudaStatus;
}
LLVM IR (properly uses i128)
; ModuleID = 'build/nvidia/wideint128.cu'
source_filename = "build/nvidia/wideint128.cu"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
; Function Attrs: convergent mustprogress noinline norecurse nounwind optnone
define dso_local void @_Z12add128KernelPU7_ExtIntILi128EEiS_S_(i128* %0, i128 %1, i128 %2) #0 {
%4 = alloca i128*, align 8
%5 = alloca i128, align 8
%6 = alloca i128, align 8
store i128* %0, i128** %4, align 8
store i128 %1, i128* %5, align 8
store i128 %2, i128* %6, align 8
%7 = load i128, i128* %5, align 8
%8 = load i128, i128* %6, align 8
%9 = add nsw i128 %7, %8
%10 = load i128*, i128** %4, align 8
store i128 %9, i128* %10, align 8
ret void
}
attributes #0 = { convergent mustprogress noinline norecurse nounwind optnone "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_86" "target-features"="+ptx72,+sm_86" }
!llvm.module.flags = !{!0, !1, !2, !3}
!nvvm.annotations = !{!4}
!llvm.ident = !{!5, !6}
!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 11, i32 2]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!3 = !{i32 7, !"frame-pointer", i32 2}
!4 = !{void (i128*, i128, i128)* @_Z12add128KernelPU7_ExtIntILi128EEiS_S_, !"kernel", i32 1}
!5 = !{!"clang version 13.0.1"}
!6 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
Nvidia PTX
//
// Generated by LLVM NVPTX Back-End
//
.version 7.1
.target sm_86
.address_size 64
// .globl _Z12add128KernelPU7_ExtIntILi128EEiS_S_ // -- Begin function _Z12add128KernelPU7_ExtIntILi128EEiS_S_
// @_Z12add128KernelPU7_ExtIntILi128EEiS_S_
.visible .entry _Z12add128KernelPU7_ExtIntILi128EEiS_S_(
.param .u64 _Z12add128KernelPU7_ExtIntILi128EEiS_S__param_0,
.param .align 16 .b8 _Z12add128KernelPU7_ExtIntILi128EEiS_S__param_1[16],
.param .align 16 .b8 _Z12add128KernelPU7_ExtIntILi128EEiS_S__param_2[16]
)
{
.local .align 8 .b8 __local_depot0[40];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .pred %p<3>;
.reg .b64 %rd<18>;
// %bb.0:
mov.u64 %SPL, __local_depot0;
cvta.local.u64 %SP, %SPL;
ld.param.v2.u64 {%rd4, %rd5}, [_Z12add128KernelPU7_ExtIntILi128EEiS_S__param_2];
ld.param.v2.u64 {%rd2, %rd3}, [_Z12add128KernelPU7_ExtIntILi128EEiS_S__param_1];
ld.param.u64 %rd1, [_Z12add128KernelPU7_ExtIntILi128EEiS_S__param_0];
cvta.to.global.u64 %rd6, %rd1;
cvta.global.u64 %rd7, %rd6;
st.u64 [%SP+0], %rd7;
st.u64 [%SP+16], %rd3;
st.u64 [%SP+8], %rd2;
st.u64 [%SP+32], %rd5;
st.u64 [%SP+24], %rd4;
ld.u64 %rd8, [%SP+16];
ld.u64 %rd9, [%SP+8];
ld.u64 %rd10, [%SP+32];
ld.u64 %rd11, [%SP+24];
add.s64 %rd12, %rd9, %rd11;
setp.lt.u64 %p1, %rd12, %rd11;
setp.lt.u64 %p2, %rd12, %rd9;
selp.u64 %rd13, 1, 0, %p2;
selp.b64 %rd14, 1, %rd13, %p1;
add.s64 %rd15, %rd8, %rd10;
add.s64 %rd16, %rd15, %rd14;
ld.u64 %rd17, [%SP+0];
st.u64 [%rd17], %rd12;
st.u64 [%rd17+8], %rd16;
ret;
// -- End function
}
but PTX supports:
Given the state of compilers, it's perhaps slightly beyond the reach of the primitives here to test for that, although it would indeed be a huge advantage if it was verified in unit tests. LLVM for example tests the assembly output of a given program in certain cases, it would actually be pretty cool to have a test that ensures that for a given platform, a "construct" compiles to a certain instruction so that "crypto stuff" can be built more reliably.
Yes, I actually spent some time looking into LLVM tests, for example add256 at https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/addcarry.ll#L47-L63 but those are architecture specific, no such guarantee on Nvidia GPUs, so I need my own GPU code generator (via LLVM IR inline assembly) despite not having constant-time requirement on GPUs :/
One thing to note is that surprisingly, compilers have no notion of carry-flag or overflow-flag in their IR,
well, the builtins do: they simply return the result + carry in a tuple: https://llvm.org/docs/LangRef.html#llvm-uadd-with-overflow-intrinsics which is close enough as far as IR goes.
So the point of these functions is really to abstract away all the mess and provide functions that return precisely the information needed to implement "common" integer stuff based on the language primitives nim offers (which for how is limited to 64 bit) - they describe a perfect underlying compiler world towards which the underlying optimizers can strive - the focus must thus be to accurately and minimally describe constraints, input and outputs so the optimizers have the right information to work with.
Even with underlying` ExtInt` support, Nim still needs a library to "expose" it - stint could very well be that library and it would use either these primitives or C23 etc - even later though will these primitives be needed, ie for arbitrary / runtime precision work.
The @llvm.uadd.with.overflow builtins are unfortunately not enough
They have a signature of
%res = call {i64, i1} @llvm.uadd.with.overflow.i64(i64 %a, i64 %b)
but we need the, carry-in, the same as the x86-only intrinsics
%res = call { i8, i64 } @llvm.x86.addcarry.64(i8 %carry_in, i64 %a, i64 %b)
I tried to look in LLVM if they compile the x86 intrinsics to the LLVM intrinsics at IR level for example here (https://github.com/llvm/llvm-project/blob/d526e2e/llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp#L544-L555) but no it directly creates an ADDCARRY ISel (instruction selection) node.
So at the LLVM level there are 4 solutions for i128 addition
(c_out, lo) = a_lo + b_lo (_, hi) = a_hi + b_hi hi += c_out
hoping that they will be merged into add+adc
But that's just some temporary problem, in the long run RISC V takes over everything which cannot be bothered with such things as "carry flags" because the entire architecture's design was done to run 4 old single-threaded unsafe C programs well...
(Sorry to digress but I really dislike RISC V.)