I needed to calculate Euclidean integer division (similar to divmod, except the remainder is always >= 0) for multi-dimensional array indexing. There didn't seem to be an implementation in the Nim standard libraries, so I wrote a euclideanQuotRem template that uses Nim's div and mod operators.
In turn, this needed a sgn(x) function (return [-1, 0, 1] depending on whether x is [negative, zero, positive]). I also wrote a trivial divmod template for completeness (since euclideanQuotRem is almost, but not quite, the same as divmod, so having one but not the other in the standard library might cause confusion).
Perhaps this code could become part of the math module?
[I was originally going to suggest that the C stdlib functions div and ldiv be importc'ed into the math module (perhaps as cdiv and cldiv), but then I did some timing and discovered that (in C at least), quot = dividend / divisor; rem = dividend - divisor*quot; is consistently faster than both ldiv(dividend, divisor) and quot = dividend / divisor; rem = dividend % divisor, both with and without optimisation flags.]
Questions/comments/suggestions about this code?
{. push overflowChecks: off .}
template sgn*(x: expr): int8 =
## Return (-1, 0, 1) if `x` is (negative, zero, positive).
##
## The return value is an `int8`, so it can be implicitly widened to
## any other integer type.
##
## The branchless approach used here was demonstrated in the following
## answer on StackOverflow:
## http://stackoverflow.com/a/4609795
##
let val = x # Avoid multiple occurrences of side effects.
let zero: type(val) = 0
(zero < val).int8 - (val < zero).int8
template euclideanQuotRem*(dividend, divisor: int): tuple[quot, rem: int] =
## Return the quotient & remainder of the Euclidean division of `dividend`
## by `divisor`.
##
## According to the definition of Euclidean integer division, the remainder
## must never be negative:
## https://en.wikipedia.org/wiki/Euclidean_division#Statement_of_the_theorem
##
## The remainder returned by this function will always be zero or positive,
## contained in the half-open range `[0, abs(divisor))`, regardless of the
## sign of `dividend` or `divisor`.
##
## Thus, the definition of this function differs from both Haskell's
## `quotRem` (which returns a remainder with the same sign as `dividend`,
## by truncating the integer division result towards zero) and Haskell's
## `divMod` (which returns a remainder with the same sign as `divisor`,
## by truncating the integer division result towards negative infinity).
## Neither of Haskell's functions conform to the definition of Euclidean
## integer division.
##
## If `dividend` is negative, the quotient & remainder returned by this
## function will also differ from the results from Nim's `div` and `mod`
## operators (which are translated to the C operators `integer / integer`
## and `%`, respectively, which return a remainder with the same sign as
## `dividend`). For results that match the `div` and `mod` operators,
## use the `divmod` function (defined below) instead.
##
let x: int = dividend # Avoid multiple occurrences of side effects.
let y: int = divisor
let sgn_divisor: int = y.sgn # [-1, 0, 1] if [negative, 0, positive]
let q = x div y
# On my machine (Intel Core i7), `rem = (dividend - divisor * quotient)`
# is consistently faster than both `rem = dividend mod divisor` and the
# C stdlib function `ldiv` -- both with the GCC "-O3" optimisation flag
# and without any optimisation flags.
let r = x - y * q
let rem_isneg = (r < 0).int8 # 1 if negative; 0 otherwise
let correction = rem_isneg * sgn_divisor
tuple(quot: q - correction, rem: r + correction * divisor)
{. pop .}
template divmod*(dividend, divisor: int): tuple[d, m: int] =
## Return the result of `div` and `mod` simultaneously.
##
## Since Nim's `div` and `mod` operators are translated to the C operators
## `integer / integer` and `%`, respectively, this function will always
## return a remainder with the same sign as `dividend`, with the integer
## division result truncated towards zero.
##
## The `integer / integer` and `%` operators in C++, Java and JavaScript
## behave the same as in C:
## https://en.wikipedia.org/wiki/Modulo_operation
##
## The Haskell equivalent of this function is called `quotRem`. There is
## also a Haskell function called `divMod`, which returns a remainder with
## the same sign as `divisor`. The Python `divmod` function behaves the
## same as the Haskell `divMod`.
##
## Haskell programming advice suggests avoiding Haskell `quotRem` in favor
## of Haskell `divMod`:
## https://wiki.haskell.org/Haskell_programming_tips#Forget_about_quot_and_rem
## However, neither of Haskell's functions conform to the definition of
## Euclidean integer division (in which the remainder is never negative).
## For correct Euclidean division, use the `euclideanQuotRem` function
## (defined above) instead.
##
let x: int = dividend # Avoid multiple occurrences of side effects.
let y: int = divisor
let quot = x div y
# On my machine (Intel Core i7), `rem = (dividend - divisor * quotient)`
# is consistently faster than both `rem = dividend mod divisor` and the
# C stdlib function `ldiv` -- both with the GCC "-O3" optimisation flag
# and without any optimisation flags.
let rem = x - y * quot
tuple(d: quot, m: rem)
Why templates instead of procs?
Yes, would be fine to have some information when we should better use procs, and when better templates.
Currently I use procs whenever possible. I have still some problems to predict the exact behaviour of templates, and my feeling is that procs are "more clean". If inlined, there should be no overhead. And -- when do we should use explicit inline pragma for procs? I tend to let the inline decision to the compilers.
Yes, would be fine to have some information when we should better use procs, and when better templates.
Use the least powerful construct that works, which means in this order:
I would only explicitly use the inline pragma if you can measure a speedup, otherwise trust the compiler.
Edit: It's also in the manual: http://nim-lang.org/manual.html#statement-macros
jboy: a general feeling that I should use templates in Nim when I would have marked a function as inline in C++.
I have found you should not do this. Using templates force-inlines things (before the C codegen to my understanding) and can have unexpected performance results. For instance, when I make all the basic Vector arithmetic procedures in my raytrace benchmark templates instead of inline procs then the performance significantly decreases. I assume this is due to the C optimizers better understanding of cache lines.
Also, unless you're compiling with --checks:on there is no reason to {.push checks:off.}. Bounds checking is automatically disabled in release builds.
Stefan_Salewski: And -- when do we should use explicit inline pragma for procs? I tend to let the inline decision to the compilers.
What the {.inline.} pragma does is to (1) annotate the generated C function as inline and (2) emit code for it in all modules that call it so that the C compiler can actually inline it. The decision whether to do any inlining still rests with the C compiler.
If the procedure is in the same module as its caller and there's no forward declaration, then you generally don't need {.inline.}, as modern C compilers are smart enough to figure that out themselves.
Templates do inlining at the Nim level (and also allow for some things that procs can't do). Gratuitous use of templates can lead to the size of the code blowing up and result in performance degradation (though, situationally, templates can also make the generated code smaller and inlining can obviously also result in speedups). I generally only use templates if (1) I need them for something that procedures don't do, if (2) the procedure version is unnecessarily inefficient or clumsy, or (3) as an optimization where I need the optimization (and it actually generates a speedup).
In the case of a simple arithmetic operation that only results in a few instructions at the assembly level, inlining (whether via templates or {.inline.}) is generally more efficient, since you don't save a lot of space, if any at all (and having to pass parameters to a function can actually make register allocation worse) and you eliminate the procedure call overhead.
Hi @filwit, thanks for the feedback. :)
That is quite a strange result that your code is slower after making those tiny procs into templates.
In my defense,
In general, sure -- I can understand the argument for making them procs rather than templates. I just figured I'd leave them in the slightly-more-subtle-to-implement-but-already-done template form for now, since I find it's easier to convert templates to procs than the other way around.
The reason for {. push overflowChecks: off .} was because it was performing a double-sided overflow check each time I subtracted [0,1] (a bool converted to an int8) from [0,1]:
TMP146 = subInt(((NI8) ((zero_99113 < val_99111))), ((NI8) ((val_99111 < zero_99113))));
if (TMP146 < -128 || TMP146 > 127) raiseOverflow();
There is absolutely no way it could overflow, so I didn't see the point of the checks at all, even in non-release mode. (I would actually be quite happy if anyone could explain how to disable the overflow checks just for that line of code.)
@Jehan, thanks for that explanation. Very interesting.