nimforum mirror - should `computedGoto` be always implied ? (ie make computedGoto noop)

timothee (orginal) [2018-11-01T20:28:35+01:00] view original

Are there any disadvantage to using computed goto ? (I'm not aware of any but could've missed something)

If not, why not make {. computedGoto.} implied automatically for all switch statements over enum (or finite range) variables (see https://nim-lang.org/docs/manual.html#pragmas-computedgoto-pragma) and deprecate that pragma which would become a noop. As is the case for {.computedGoto.}, switch statement would transform to a computed goto only if backend supports it.

doofenstein (orginal) [2018-11-01T20:59:08+01:00] view original

if I undestood it correctly, computed gotos are used to reduce the branching in interpreter loops(a head moving through data and then executing something depending on the read value).

From one of your links:

int interp_switch(unsigned char* code, int initval) {
    int pc = 0;
    int val = initval;
    
    while (1) {
        switch (code[pc++]) { // branch depending on the value
            case OP_HALT:
                return val;
            case OP_INC:
                val++;
                break; // branch to the end of the switch statement
            case OP_DEC:
                val--;
                break;
            case OP_MUL2:
                val *= 2;
                break;
            case OP_DIV2:
                val /= 2;
                break;
            case OP_ADD7:
                val += 7;
                break;
            case OP_NEG:
                val = -val;
                break;
            default:
                return val;
        }
    } // branch to the beginning of the loop. The compiler might optimise it to make the break jump directly to the top
}

So in this unoptimised case we have two or three branches per interpreted instruction.

The computed goto allows us to reduce it to only one branch per instruction:

int interp_cgoto(unsigned char* code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes
    */
    static void* dispatch_table[] = {
        &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
        &&do_div2, &&do_add7, &&do_neg};
    #define DISPATCH() goto *dispatch_table[code[pc++]]
    
    int pc = 0;
    int val = initval;
    
    DISPATCH(); // jump to the next instruction
    while (1) {
        do_halt:
            return val;
        do_inc:
            val++;
            DISPATCH(); // directly jump to the next instruction
        do_dec:
            val--;
            DISPATCH();
        do_mul2:
            val *= 2;
            DISPATCH();
        do_div2:
            val /= 2;
            DISPATCH();
        do_add7:
            val += 7;
            DISPATCH();
        do_neg:
            val = -val;
            DISPATCH();
    }
}

timothee (orginal) [2018-11-01T23:32:01+01:00] view original

well not sure it's only for "interpreter loops" ; seems to be applicable any time you have a switch on a finite range (eg enum; maybe also 8 bit integers in case all integers are specified);

so the question remains: is there any disadvantage to always turn on this optimization? (which as I mentioned, will only apply for backends that support it)

juancarlospaco (orginal) [2018-11-02T01:15:50+01:00] view original

If the table is huge it can use more memory than a "naked" switch ?, probably not, no idea.

timothee (orginal) [2018-11-02T01:21:42+01:00] view original

it can use more memory than a "naked" switch

that doesn't make sense to me; it would be a array (ie compile time length), so would only use stack allocation; furthermore I don't see how switch would use less stack memory

Araq (orginal) [2018-11-02T07:45:29+01:00] view original

The point of computed goto it not that an array lookup can be used, compilers do that for switch too and Nim helps the C compilers by injecting something like default: unreachable() for exhaustive case statements, so that the index check before the array is accessed can be eliminated. (Something which the oh-so-efficient-C cannot do with its weakly typed enums.)

The point of computed goto is to duplicate the dispatching logic so that the branch predictors in the CPU have multiple different conditions to predict. This may sound weird, but consider that in an interpreter you have correlations between your opcodes, for example, "after a push instruction a call instruction is more likely to follow" and duplicating of the dispatching logic takes advantage of these correlations.

This only makes sense when your case is within a loop, it's not generally applicable for case statements.

juancarlospaco (orginal) [2018-11-02T17:57:35+01:00] view original

I was wondering how long your switch needs to be to use stuff like computedGoto or linearScanEnd ?.

Because if you have >100 lines of switch obviously it makes sense, but if you use it on a switch that only has like 3 options, is considered bad style?

Araq (orginal) [2018-11-02T21:41:05+01:00] view original

I did some benchmarking for an inner loop of a regex engine and 4-5 different opcodes and nothing beat linear probing, so it's like you say, tables win after maybe 8 or more options.

mratsim (orginal) [2018-11-04T09:06:25+01:00] view original

I have done extensive benchmarking of various dispatching techniques in Nim with a toy 7 instructions VM.

Results are the following:


# interp_switch took 8.604712000000003s for 1000000000 instructions: 116.2153945419672 Mips (M instructions/s)
# interp_cgoto took 7.367597000000004s for 1000000000 instructions: 135.7294651159665 Mips (M instructions/s)
# interp_ftable took 8.957571000000002s for 1000000000 instructions: 111.6374070604631 Mips (M instructions/s)
# interp_handlers took 11.039072s for 1000000000 instructions: 90.58732473164413 Mips (M instructions/s)
# interp_methods took 23.359635s for 1000000000 instructions: 42.80888806695823 Mips (M instructions/s)

@Araq is right, the main advantage of computed gotos is to better use the hardware indirect branch predictor if your case statement is done in a loop. Using a table instead would be a guaranteed cache miss.

Besides the indirect branch predictor there are also the following hardware predictors:

Linear or straight-line code

Conditional branches

Calls and Returns

In assembly, computed gotos generates a jump, but if we could generate call and ret instead (without pushing and popping function parameters on the stack) we could get even faster speed, see the Context Threading section .

cumulonimbus (orginal) [2018-11-04T12:30:11+01:00] view original

I can't find it right now, and I might be mistaken, but IIRC Mike Pall (of LuaJIT fame) managed to make this call/ret trickery obsolete in the LuaJIT2 interpreter, by pipelining the decoding of the next instruction to overlay with the fetching of the next address, along the lines described in https://nominolo.blogspot.com/2012/07/implementing-fast-interpreters.html.

In my opinion, if you have to resort to machine code (which the context threading solution does), then you might as well spend a little more to overlay the operations; it will be obsolete within 10 years either way thanks to architecture differences. Their solution does carry more easily to more architectures, I'll give them that, but the question is "how fast can you go", not "how fast can you go for only 400 implementation lines".

mratsim (orginal) [2018-11-04T22:35:30+01:00] view original

It's not obsolete, what is described in your link is leveraging instruction-level parallelism (ILP). This is useful if there is a difference between the cycle latency (if data dependencies) or cycle throughput (if no data dependencies) of the instructions. There is no need to use assembly for ILP, a simple a += 1; goto foo; in C will make use of it.

If we take the LuaJIT 2 example:


    -- 3. Dispatch next instruction
    mov eax, [esi]    -- Load next instruction into eax
    movzx ecx, ah     -- Predecode A into ecx
    movzx ebp, al     -- zero-extend OP into ebp
    add esi, 4        -- increment program counter
    shr eax, 16       -- predecode D
    jmp [ebx + ebp * 4]  -- jump to next instruction via dispatch table

add and shr can be interleaved with jmp as they don't touch ebx and eax register. Checking the Haswell arch as a baseline

jmp [ebx + ebp * 4] has a throughput of 2 cycles + branch misprediction cost.

add esi, 4 has a throughput of 0.25 cycle

shr eax, 16 has a throughput of 1 cycle

This means that there is still an overhead of 1 cycle + branch misprediction cost. Context threading is still relevant here to improve branch prediction.

cumulonimbus (orginal) [2018-11-05T12:49:54+01:00] view original

You are probably right, I've been away from this level of implementation for the last decade, but I was under the impression that if the branch target is fully known enough cycles in advance (as could be in this case if the jump table is in L1, if loading of ebp was pushed further back, and which IIRC LuaJIT2 does do) there is, in fact, no prediction involved. I'm likely wrong, though, and extrapolated the state of the art farther than where it actually is.

Mirror of forum.nim-lang.org

4347 :: should `computedGoto` be always implied ? (ie make computedGoto noop)