Are there any disadvantage to using computed goto ? (I'm not aware of any but could've missed something)
If not, why not make {. computedGoto.} implied automatically for all switch statements over enum (or finite range) variables (see https://nim-lang.org/docs/manual.html#pragmas-computedgoto-pragma) and deprecate that pragma which would become a noop. As is the case for {.computedGoto.}, switch statement would transform to a computed goto only if backend supports it.
if I undestood it correctly, computed gotos are used to reduce the branching in interpreter loops(a head moving through data and then executing something depending on the read value).
From one of your links:
int interp_switch(unsigned char* code, int initval) {
int pc = 0;
int val = initval;
while (1) {
switch (code[pc++]) { // branch depending on the value
case OP_HALT:
return val;
case OP_INC:
val++;
break; // branch to the end of the switch statement
case OP_DEC:
val--;
break;
case OP_MUL2:
val *= 2;
break;
case OP_DIV2:
val /= 2;
break;
case OP_ADD7:
val += 7;
break;
case OP_NEG:
val = -val;
break;
default:
return val;
}
} // branch to the beginning of the loop. The compiler might optimise it to make the break jump directly to the top
}
So in this unoptimised case we have two or three branches per interpreted instruction.
The computed goto allows us to reduce it to only one branch per instruction:
int interp_cgoto(unsigned char* code, int initval) {
/* The indices of labels in the dispatch_table are the relevant opcodes
*/
static void* dispatch_table[] = {
&&do_halt, &&do_inc, &&do_dec, &&do_mul2,
&&do_div2, &&do_add7, &&do_neg};
#define DISPATCH() goto *dispatch_table[code[pc++]]
int pc = 0;
int val = initval;
DISPATCH(); // jump to the next instruction
while (1) {
do_halt:
return val;
do_inc:
val++;
DISPATCH(); // directly jump to the next instruction
do_dec:
val--;
DISPATCH();
do_mul2:
val *= 2;
DISPATCH();
do_div2:
val /= 2;
DISPATCH();
do_add7:
val += 7;
DISPATCH();
do_neg:
val = -val;
DISPATCH();
}
}
well not sure it's only for "interpreter loops" ; seems to be applicable any time you have a switch on a finite range (eg enum; maybe also 8 bit integers in case all integers are specified);
so the question remains: is there any disadvantage to always turn on this optimization? (which as I mentioned, will only apply for backends that support it)
it can use more memory than a "naked" switch
that doesn't make sense to me; it would be a array (ie compile time length), so would only use stack allocation; furthermore I don't see how switch would use less stack memory
The point of computed goto it not that an array lookup can be used, compilers do that for switch too and Nim helps the C compilers by injecting something like default: unreachable() for exhaustive case statements, so that the index check before the array is accessed can be eliminated. (Something which the oh-so-efficient-C cannot do with its weakly typed enums.)
The point of computed goto is to duplicate the dispatching logic so that the branch predictors in the CPU have multiple different conditions to predict. This may sound weird, but consider that in an interpreter you have correlations between your opcodes, for example, "after a push instruction a call instruction is more likely to follow" and duplicating of the dispatching logic takes advantage of these correlations.
This only makes sense when your case is within a loop, it's not generally applicable for case statements.
I was wondering how long your switch needs to be to use stuff like computedGoto or linearScanEnd ?.
Because if you have >100 lines of switch obviously it makes sense, but if you use it on a switch that only has like 3 options, is considered bad style?
I have done extensive benchmarking of various dispatching techniques in Nim with a toy 7 instructions VM.
Results are the following:
# interp_switch took 8.604712000000003s for 1000000000 instructions: 116.2153945419672 Mips (M instructions/s)
# interp_cgoto took 7.367597000000004s for 1000000000 instructions: 135.7294651159665 Mips (M instructions/s)
# interp_ftable took 8.957571000000002s for 1000000000 instructions: 111.6374070604631 Mips (M instructions/s)
# interp_handlers took 11.039072s for 1000000000 instructions: 90.58732473164413 Mips (M instructions/s)
# interp_methods took 23.359635s for 1000000000 instructions: 42.80888806695823 Mips (M instructions/s)
@Araq is right, the main advantage of computed gotos is to better use the hardware indirect branch predictor if your case statement is done in a loop. Using a table instead would be a guaranteed cache miss.
Besides the indirect branch predictor there are also the following hardware predictors:
In assembly, computed gotos generates a jump, but if we could generate call and ret instead (without pushing and popping function parameters on the stack) we could get even faster speed, see the Context Threading section .
I can't find it right now, and I might be mistaken, but IIRC Mike Pall (of LuaJIT fame) managed to make this call/ret trickery obsolete in the LuaJIT2 interpreter, by pipelining the decoding of the next instruction to overlay with the fetching of the next address, along the lines described in https://nominolo.blogspot.com/2012/07/implementing-fast-interpreters.html.
In my opinion, if you have to resort to machine code (which the context threading solution does), then you might as well spend a little more to overlay the operations; it will be obsolete within 10 years either way thanks to architecture differences. Their solution does carry more easily to more architectures, I'll give them that, but the question is "how fast can you go", not "how fast can you go for only 400 implementation lines".
It's not obsolete, what is described in your link is leveraging instruction-level parallelism (ILP). This is useful if there is a difference between the cycle latency (if data dependencies) or cycle throughput (if no data dependencies) of the instructions. There is no need to use assembly for ILP, a simple a += 1; goto foo; in C will make use of it.
If we take the LuaJIT 2 example:
-- 3. Dispatch next instruction
mov eax, [esi] -- Load next instruction into eax
movzx ecx, ah -- Predecode A into ecx
movzx ebp, al -- zero-extend OP into ebp
add esi, 4 -- increment program counter
shr eax, 16 -- predecode D
jmp [ebx + ebp * 4] -- jump to next instruction via dispatch table
add and shr can be interleaved with jmp as they don't touch ebx and eax register. Checking the Haswell arch as a baseline
This means that there is still an overhead of 1 cycle + branch misprediction cost. Context threading is still relevant here to improve branch prediction.