nimforum mirror - Reason for slow down with a sub proc (in 'How I start / Brainfuck')

OderWat (orginal) [2015-02-11T15:51:55+01:00] view original

Recently def- published the "How I start" for Nim with the Brainfuck interpreter example!

I followed this and found something I want to ask about. Taking the interpreter in it's final state and comparing it before it is enclosed by the proc to define the library I get different speeds for running it with the mandbrot brainfuck code.

time nim c -r -d:release brainfuck examples/mandelbrot.b

import os

proc interpret*(code: string) =
    ## Interprets the brainfuck `code` string, reading from stdin and writing to
    ## stdout.
    var
        tape = newSeq[char]()
        codePos = 0
        tapePos = 0
    
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                #echo "codePos: ", codePos, " tapePos: ", tapePos
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()


when isMainModule:
    let code = if paramCount() > 0: readFile paramStr(1)
               else: readAll stdin
    
    interpret code

is about 10%-20% slower than the following version:

import os

let code = if paramCount() > 0: readFile paramStr(1)
     else: readAll stdin

var
    tape = newSeq[char]()
    codePos = 0
    tapePos = 0

proc run(skip = false): bool =
    while tapePos >= 0 and codePos < code.len:
        if tapePos >= tape.len:
            tape.add '\0'
        
        if code[codePos] == '[':
            inc codePos
            let oldPos = codePos
            while run(tape[tapePos] == '\0'):
                codePos = oldPos
        elif code[codePos] == ']':
            return tape[tapePos] != '\0'
        elif not skip:
            case code[codePos]
            of '+': inc tape[tapePos]
            of '-': dec tape[tapePos]
            of '>': inc tapePos
            of '<': dec tapePos
            of '.': stdout.write tape[tapePos]
            of ',': tape[tapePos] = stdin.readChar
            else: discard
        
        inc codePos

discard run()

Why is the second version always faster (something with recursive calls and/or because it becomes a closure maybe)?

What need to be changed to get no performance degration by "just" enclosing everything in another proc?

Jehan (orginal) [2015-02-11T16:13:36+01:00] view original

"Just" enclosing everything in another proc is actually a pretty big deal, because now the previously global variables become variables that are local to the outer proc; they can neither be accessed through a fixed global address nor relative to the current stack frame. In order to make it possible to access the outer stack frame, there is overhead involved that code that only uses global or local variables does not have. This may also preclude some optimizations.

OderWat (orginal) [2015-02-11T16:53:38+01:00] view original

Yes I thought this too (but see the last version in this post).

So I moved the vars back into the global scope. Like this:

import os

var
    tape = newSeq[char]()
    codePos = 0
    tapePos = 0

let code = if paramCount() > 0: readFile paramStr(1)
           else: readAll stdin

proc interpret*(code: string) =
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()


interpret code

It still is slower by the same amount than if the proc is not enclosed by another proc.

Then I made code also global once again and that results into the "fast" version. Even when the proc is encapsulated.

import os

let code = if paramCount() > 0: readFile paramStr(1)
       else: readAll stdin

var
    tape = newSeq[char]()
    codePos = 0
    tapePos = 0

proc interpret*() =
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()

interpret()

So I tried to move everything inside the proc and that results again in the fast version:

import os

proc interpret*() =
    let code = if paramCount() > 0: readFile paramStr(1)
           else: readAll stdin
    
    var
        tape = newSeq[char]()
        codePos = 0
        tapePos = 0
    
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()

interpret()

I suspect this is because optimizations. But in general I just don't like it that something gets so much slower out of the blue. I don't expect this for this code. The fact that the first version (only the parameter code is local) in this post is slower than the second one seems "unfair" to me :)

So I made a "copy" of the parameter inside of "interpret()" and ... voila! This runs fast:

import os

proc interpret*(codeIn: string) =
    var
        tape = newSeq[char]()
        codePos = 0
        tapePos = 0
    let
        code = codeIn # creating a 'copy' of the parameter
    
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()

let codeExt = if paramCount() > 0: readFile paramStr(1)
       else: readAll stdin

interpret(codeExt)

Isn't this something which could be done by the compiler (if it actually is the reason)?

P.S.: This feels a lot like coding in Rust ;)

Jehan (orginal) [2015-02-11T18:46:58+01:00] view original

OderWat: So I moved the vars back into the global scope. Like this: [...] It still is slower by the same amount than if the proc is not enclosed by another proc.

Moving the variables into the global scope (or annotating them with {.global.}) still requires an implicit pointer to the environment for run. This could conceivably be optimized away (because there are no variables in the environment anymore), but the point of having nested procedures is generally that you want them to have access to variables in the outer procedure (otherwise you could just move the entire procedure outside). So, I'm not sure if it's a high priority to optimize for that particular case.

OderWat: So I made a "copy" of the parameter inside of "interpret()" and ... voila! This runs fast:

For what it's worth, they all seem have about the same performance for me (+/- 2% or so). I cannot reproduce the speedup (for any of the supposedly faster versions in your second post). That may be because of different OSes/compilers or whatever.

OderWat: But in general I just don't like it that something gets so much slower out of the blue. I don't expect this for this code.

Welcome to the world of modern CPUs and compilers where performance is essentially unpredictable (at least at that level of precision)? Can you, for example, predict all of the places where you may encounter pipeline stalls just from looking at the code of a reasonably high-level language?

def (orginal) [2015-02-11T18:56:57+01:00] view original

time nim c -r -d:release brainfuck examples/mandelbrot.b

This measures the time of the compilation and execution, not jut the execution. Funnily, when measuring just the execution, I see the opposite of what you're describing. The first code you posted is 20% faster for me than the second one.

OderWat (orginal) [2015-02-11T19:55:44+01:00] view original

I am not talking about 1-2% but it was about 10-20% (55 seconds against 63 seconds). I understand that you shrug off 1-2% with current cpu's / systems.

Compiler was CLANG für all my tests. GCC performs slower overall for those (just tested).

Interestingly I just created all versions again and it is 58 vs 57 now for copying the parameter or not in the wrapped version. Basically what you just wrote!

Funny enough... Now the version which does not wrap the "proc run()" (second example in first post) is now slower than the wrapped ones using consistently >60 seconds. So it is reversed and less of a difference now.

It was "fast" <56 and "slow" >63 consistently before (I swear). In the same shell/terminal. I would not have written something for less difference!

Still now every test shows that the original "unwrapped" code performs slower than the wrapped one.

So I was like what ??? happened?

Going back to my original editor window I can still undo/redo all the changes and versions. Trying the version which is supposed to be the "58" second version now. It uses 64 seconds (instead of 58). Adding in the "let code = codeIn" copying of the parameter in that source runs also with 64 seconds. Same as before but 7 seconds slower?

What?

Then I did what I did before I posted the code into the forum: Deleting a comment and the "Welcome to brainfuck" echo command in the global context.

BANG.. 57 Seconds.

Adding back in the "echo":

64 Seconds...

Basically this:

import os

proc interpret*(code: string) =
    var
        tape = newSeq[char]()
        codePos = 0
        tapePos = 0
    
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()

when isMainModule:
    echo "Welcme to brainfuck"
    
    let code = if paramCount() > 0: readFile paramStr(1)
               else: readAll stdin
    
    interpret code

runs 7-8 secs slower than this:

import os

proc interpret*(code: string) =
    var
        tape = newSeq[char]()
        codePos = 0
        tapePos = 0
    
    proc run(skip = false): bool =
        while tapePos >= 0 and codePos < code.len:
            if tapePos >= tape.len:
                tape.add '\0'
            
            if code[codePos] == '[':
                inc codePos
                let oldPos = codePos
                while run(tape[tapePos] == '\0'):
                    codePos = oldPos
            elif code[codePos] == ']':
                return tape[tapePos] != '\0'
            elif not skip:
                case code[codePos]
                of '+': inc tape[tapePos]
                of '-': dec tape[tapePos]
                of '>': inc tapePos
                of '<': dec tapePos
                of '.': stdout.write tape[tapePos]
                of ',': tape[tapePos] = stdin.readChar
                else: discard
            
            inc codePos
    
    discard run()

when isMainModule:
    #echo "Welcme to brainfuck"
    
    let code = if paramCount() > 0: readFile paramStr(1)
               else: readAll stdin
    
    interpret code

In other words: I fooled myself while the moving code because I moved/removed this single echo while modifying the examples I posted to the forum. AFTER writing my post I removed that echo and some comments from the code posted here ... because ... I thought this is unrelated code for the testing.

In the end, there are now two different questions:

Is it "acceptable" that adding the echo line cost >10% performance (consistently on my system)?

Why is it actually faster if wrapped suddenly (I double checked that both have no echo).

OderWat (orginal) [2015-02-11T19:57:32+01:00] view original

@def yeah! same what I get now. But because of the reasons in my last post it seems. The compile time is near to "nothing" on my system (OSX / 64 GB RAM / SDD / i7).

LeuGim (orginal) [2015-02-11T20:00:33+01:00] view original

but the point of having nested procedures is generally that you want them to have access to variables in the outer procedure (otherwise you could just move the entire procedure outside)

There can be also the reason of structuring the code, so that procs used just in one other proc are defined inside it and keep the outer level more clean.

Jehan (orginal) [2015-02-11T20:17:40+01:00] view original

LeuGim: There can be also the reason of structuring the code, so that procs used just in one other proc are defined inside it and keep the outer level more clean.

I'm generally not a fan of that approach. If you have any reusable code that does not depend on a procedure's parameters or local state, then it's likely to be reusable by more than one procedure, and then you have to move it outside the procedure. This typically leads to modules or classes as the abstraction tool of choice, not a "russian doll" kind of procedure.

OderWat (orginal) [2015-02-11T22:03:40+01:00] view original

Well, I think that just really depends on what pattern you want to establish there.

I would include procs inside other ones all day, just to make clear that they are just helpers used by those!

Which is the opposite of having a lot of procs in the outer scope which are just used by one of its siblings for the purpose of refactoring or structure.

So yes it is fine to have code reused but there are a lot of reasons why this should not be overused. You even may end up with a situation where you had one helper proc which was "outside" and got changed and copied as specialized version for another purpose.

Taking a function "outside" also often means that you need to extend its functionality and end up with a "can do everything" version which is just not efficient for the original use case.

Jehan (orginal) [2015-02-12T12:55:20+01:00] view original

OderWat: I am not talking about 1-2% but it was about 10-20% (55 seconds against 63 seconds). I understand that you shrug off 1-2% with current cpu's / systems.

First of all, I'm not shrugging it off. I'm saying that I cannot reproduce a 10%-20% difference.

Second, 10%-20% overhead (or more) can occur through a number of reasons that you have no control over, such as a BTB collision or data being aligned poorly with cache lines. Or a compiler backend generating poor code for some reason (it happens, optimizers are horribly complex beasts these days with a ton of heuristics involved; even absent poor implementations, sometimes heuristics fail).

This is also why trying to optimize microbenchmarks is ultimately a waste of time. As soon as your code becomes part of a bigger system, totally different performance issues may manifest because non-local issues take over. As a simple example, while a microbenchmark may not utilize a cache fully and you can gain performance by precomputing values and storing them in a lookup table, this may squeeze out more important data in a large system, increase cache pressure, and degrade performance. The same goes for repeatedly inlining large segments of code. Making local variables global may, in a larger system, reduce the locality of memory access patterns because the most frequently accessed global variables are now spread out over more cache lines. Etc.

LeuGim (orginal) [2015-02-12T14:24:20+01:00] view original

Jehan: I did not vote for optimizing away overhead of passing the environment pointer, I talked about defining not reusable and not modifying outer state helper procs (often no-side-effect) inside the proc that uses them. Can't say how often, but I use such "structured" procs. There's one disadvantage however - outer procedure header stays away of the actual procedure body, so it's not so easy to see the procedure's parameters. The advantage is that it's easier to look through a module's procs list and there are only reusable procs (no clutter of specific helpers).

OderWat (orginal) [2015-02-12T16:12:37+01:00] view original

TL;DR: I reason about a slow down between 10 and 20 percent in code which runs 60 seconds because I include an echo in the code which is called once and says "hello!".

Jehan I analysed what causes the performance degration and it is not the fact that the proc is encapsulated in another one. This in fact is faster for me and at least one other person.

You write "10%-20% can occur through ... or a compiler backend generating poor code ..." well.

It can also occur because of a compiler having a bug, missing optimizations or is just by design. Same goes for the standard libraries. Afaik we are here in the "Nim Compiler & Language" forum. And I think that everybody should be interested about 10%-20% performance degration when outputing the text "Welcome to brainfuck" in the terminal. That of course probably has a simple explanation and could be unavoidable. I just want the people to double check.

Everything else about your philosophy and lecturing where and when and how I should place my code is not really helpful in this respect. It may be interesting to talk about that but the problem is down to including an echo in the code!

I also know possible (and impossible) reasons why benchmarking has its own problems. But: I was neither benchmarking nor optimizing the code! I just "feelt" (without a clock) that the program was suddenly slower without changing how it works. Then I measured it to find the culprit.

The brainfuck interpreter using input/output and running mandelbrot for a minute also hardly qualifies anymore as a microbenchmark (in my eyes).

It is reproducible every single time on my system. I don't know if you even saw the twist to the original post.

LeuGim I agree. There is also the fact that in a large codebase you sometimes lose track of which procs are still in use. Not every language and setup has dead code warnings and you may also don't want to check for such changes. IF Nim would be slower when doing this I would probably not use it. But I see no reason why that should be the case for the procs we talk about.

def (orginal) [2015-02-12T16:26:04+01:00] view original

TL;DR: I reason about a slow down between 10 and 20 percent in code which runs 60 seconds because I include an echo in the code which is called once and says "hello!".

I can't reproduce this at all here (Linux, x86_64, GCC 4.9.2). With or without the echo it takes the same time to run for me. Maybe there are some other effects at play here, like your C compiler's optimizations or your CPU getting hotter.

Araq (orginal) [2015-02-12T16:30:06+01:00] view original

And I think that everybody should be interested about 10%-20% performance degration when outputing the text "Welcome to brainfuck" in the terminal.

Yes, and people were interested and the result was "cannot reproduce". At this point you have to use a profiler and tell us why it is that much slower on your machine. Or you actually time the echo operation itself and see if it really takes 6 seconds.

Jehan (orginal) [2015-02-12T17:19:22+01:00] view original

I'll add that both clang and gcc optimize the C code that's being produced by echo "some string constant" down to a puts("some string constant") call. You can inspect the assembly code generated if you want. That means that if there's anything that's slowing it down that is not something entirely unpredictable, it's the libc implementation of puts() (which I consider extremely unlikely).

OderWat (orginal) [2015-02-12T18:28:55+01:00] view original

I understand that something which can't be produced on your systems is not helpful.

BTW: I never said (or concluded) that the echo takes 6 seconds! I said that including the echo the code overall is running slower.

I guess that something which is linked, compiled or differently optimized because of the echo statement has the effect of the slowdown. It may just be that the memory layout changes. I have no idea about compiler internals anymore. In the past I wrote C code and tuned it comparing the result with the produced assembler. This was 10 years ago. We aligned structures manually. Today with multiple cores, mapped memory and multiple caches this is all a big mess. My last core system drivers where written for Motorolas CPUs.

So.. Just one more testing on different machines: ESXi VM Debian / iMac i7 / MBPR i7:

It comes down to this:

GCC is the same timing and independent of the echo on all machines.

CLANG has a slowdown of about 6 seconds for both OSX machines (LLVM 3.5) when the echo statement is included. There is no difference for clang (LLVM 3.0) on debian.

As it seems "nobody" will be able to reproduce it, just because it is OSX / CLANG specific (which is Apples special clang version). I would be glad to hear from another Mac guy that I am not nuts. I still have another imac and macbook I could test but I feel that is kinda pointless.

I am not sure if I have to apologize for anything. I am not here to bash anybody. Actually I am pleased that you still listen :)

P.S.: I am not sure what to do next. Probably just ignoring the problem. I could always just compile with gcc. For the most part it seems currently not important for the fate of Nim.

def (orginal) [2015-02-12T19:28:24+01:00] view original

P.S.: I am not sure what to do next. Probably just ignoring the problem. I could always just compile with gcc. For the most part it seems currently not important for the fate of Nim.

You could try to narrow down the problem and submit a bug report to whatever is at fault, be it clang, llvm or something apple specific. (or check if they already have one) For what it's worth, on Linux with Clang 3.5.1 I get no slowdown with echo.

Jehan (orginal) [2015-02-12T20:26:19+01:00] view original

OderWat: As it seems "nobody" will be able to reproduce it, just because it is OSX / CLANG specific (which is Apples special clang version). I would be glad to hear from another Mac guy that I am not nuts. I still have another imac and macbook I could test but I feel that is kinda pointless.

I've now been able to (sort of) reproduce it with that specific setup, though there's only about a three second difference and there's considerable variance between runs, but on average it appears to be there. That slowdown disappears when you use echo "x" instead, so most likely it's simply a memory layout issue. I.e. nothing that you can really control.

Mirror of forum.nim-lang.org

858 :: Reason for slow down with a sub proc (in 'How I start / Brainfuck')