Recently def- published the "How I start" for Nim with the Brainfuck interpreter example!
I followed this and found something I want to ask about. Taking the interpreter in it's final state and comparing it before it is enclosed by the proc to define the library I get different speeds for running it with the mandbrot brainfuck code.
time nim c -r -d:release brainfuck examples/mandelbrot.b
import os
proc interpret*(code: string) =
## Interprets the brainfuck `code` string, reading from stdin and writing to
## stdout.
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
#echo "codePos: ", codePos, " tapePos: ", tapePos
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
when isMainModule:
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
interpret code
is about 10%-20% slower than the following version:
import os
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
Why is the second version always faster (something with recursive calls and/or because it becomes a closure maybe)?
What need to be changed to get no performance degration by "just" enclosing everything in another proc?
Yes I thought this too (but see the last version in this post).
So I moved the vars back into the global scope. Like this:
import os
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
proc interpret*(code: string) =
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
interpret code
It still is slower by the same amount than if the proc is not enclosed by another proc.
Then I made code also global once again and that results into the "fast" version. Even when the proc is encapsulated.
import os
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
proc interpret*() =
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
interpret()
So I tried to move everything inside the proc and that results again in the fast version:
import os
proc interpret*() =
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
interpret()
I suspect this is because optimizations. But in general I just don't like it that something gets so much slower out of the blue. I don't expect this for this code. The fact that the first version (only the parameter code is local) in this post is slower than the second one seems "unfair" to me :)
So I made a "copy" of the parameter inside of "interpret()" and ... voila! This runs fast:
import os
proc interpret*(codeIn: string) =
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
let
code = codeIn # creating a 'copy' of the parameter
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
let codeExt = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
interpret(codeExt)
Isn't this something which could be done by the compiler (if it actually is the reason)?
P.S.: This feels a lot like coding in Rust ;)
OderWat: So I moved the vars back into the global scope. Like this: [...] It still is slower by the same amount than if the proc is not enclosed by another proc.
Moving the variables into the global scope (or annotating them with {.global.}) still requires an implicit pointer to the environment for run. This could conceivably be optimized away (because there are no variables in the environment anymore), but the point of having nested procedures is generally that you want them to have access to variables in the outer procedure (otherwise you could just move the entire procedure outside). So, I'm not sure if it's a high priority to optimize for that particular case.
OderWat: So I made a "copy" of the parameter inside of "interpret()" and ... voila! This runs fast:
For what it's worth, they all seem have about the same performance for me (+/- 2% or so). I cannot reproduce the speedup (for any of the supposedly faster versions in your second post). That may be because of different OSes/compilers or whatever.
OderWat: But in general I just don't like it that something gets so much slower out of the blue. I don't expect this for this code.
Welcome to the world of modern CPUs and compilers where performance is essentially unpredictable (at least at that level of precision)? Can you, for example, predict all of the places where you may encounter pipeline stalls just from looking at the code of a reasonably high-level language?
time nim c -r -d:release brainfuck examples/mandelbrot.b
This measures the time of the compilation and execution, not jut the execution. Funnily, when measuring just the execution, I see the opposite of what you're describing. The first code you posted is 20% faster for me than the second one.
I am not talking about 1-2% but it was about 10-20% (55 seconds against 63 seconds). I understand that you shrug off 1-2% with current cpu's / systems.
Compiler was CLANG für all my tests. GCC performs slower overall for those (just tested).
Interestingly I just created all versions again and it is 58 vs 57 now for copying the parameter or not in the wrapped version. Basically what you just wrote!
Funny enough... Now the version which does not wrap the "proc run()" (second example in first post) is now slower than the wrapped ones using consistently >60 seconds. So it is reversed and less of a difference now.
It was "fast" <56 and "slow" >63 consistently before (I swear). In the same shell/terminal. I would not have written something for less difference!
Still now every test shows that the original "unwrapped" code performs slower than the wrapped one.
So I was like what ??? happened?
Going back to my original editor window I can still undo/redo all the changes and versions. Trying the version which is supposed to be the "58" second version now. It uses 64 seconds (instead of 58). Adding in the "let code = codeIn" copying of the parameter in that source runs also with 64 seconds. Same as before but 7 seconds slower?
What?
Then I did what I did before I posted the code into the forum: Deleting a comment and the "Welcome to brainfuck" echo command in the global context.
BANG.. 57 Seconds.
Adding back in the "echo":
64 Seconds...
Basically this:
import os
proc interpret*(code: string) =
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
when isMainModule:
echo "Welcme to brainfuck"
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
interpret code
runs 7-8 secs slower than this:
import os
proc interpret*(code: string) =
var
tape = newSeq[char]()
codePos = 0
tapePos = 0
proc run(skip = false): bool =
while tapePos >= 0 and codePos < code.len:
if tapePos >= tape.len:
tape.add '\0'
if code[codePos] == '[':
inc codePos
let oldPos = codePos
while run(tape[tapePos] == '\0'):
codePos = oldPos
elif code[codePos] == ']':
return tape[tapePos] != '\0'
elif not skip:
case code[codePos]
of '+': inc tape[tapePos]
of '-': dec tape[tapePos]
of '>': inc tapePos
of '<': dec tapePos
of '.': stdout.write tape[tapePos]
of ',': tape[tapePos] = stdin.readChar
else: discard
inc codePos
discard run()
when isMainModule:
#echo "Welcme to brainfuck"
let code = if paramCount() > 0: readFile paramStr(1)
else: readAll stdin
interpret code
In other words: I fooled myself while the moving code because I moved/removed this single echo while modifying the examples I posted to the forum. AFTER writing my post I removed that echo and some comments from the code posted here ... because ... I thought this is unrelated code for the testing.
In the end, there are now two different questions:
but the point of having nested procedures is generally that you want them to have access to variables in the outer procedure (otherwise you could just move the entire procedure outside)
There can be also the reason of structuring the code, so that procs used just in one other proc are defined inside it and keep the outer level more clean.
LeuGim: There can be also the reason of structuring the code, so that procs used just in one other proc are defined inside it and keep the outer level more clean.
I'm generally not a fan of that approach. If you have any reusable code that does not depend on a procedure's parameters or local state, then it's likely to be reusable by more than one procedure, and then you have to move it outside the procedure. This typically leads to modules or classes as the abstraction tool of choice, not a "russian doll" kind of procedure.
Well, I think that just really depends on what pattern you want to establish there.
I would include procs inside other ones all day, just to make clear that they are just helpers used by those!
Which is the opposite of having a lot of procs in the outer scope which are just used by one of its siblings for the purpose of refactoring or structure.
So yes it is fine to have code reused but there are a lot of reasons why this should not be overused. You even may end up with a situation where you had one helper proc which was "outside" and got changed and copied as specialized version for another purpose.
Taking a function "outside" also often means that you need to extend its functionality and end up with a "can do everything" version which is just not efficient for the original use case.
OderWat: I am not talking about 1-2% but it was about 10-20% (55 seconds against 63 seconds). I understand that you shrug off 1-2% with current cpu's / systems.
First of all, I'm not shrugging it off. I'm saying that I cannot reproduce a 10%-20% difference.
Second, 10%-20% overhead (or more) can occur through a number of reasons that you have no control over, such as a BTB collision or data being aligned poorly with cache lines. Or a compiler backend generating poor code for some reason (it happens, optimizers are horribly complex beasts these days with a ton of heuristics involved; even absent poor implementations, sometimes heuristics fail).
This is also why trying to optimize microbenchmarks is ultimately a waste of time. As soon as your code becomes part of a bigger system, totally different performance issues may manifest because non-local issues take over. As a simple example, while a microbenchmark may not utilize a cache fully and you can gain performance by precomputing values and storing them in a lookup table, this may squeeze out more important data in a large system, increase cache pressure, and degrade performance. The same goes for repeatedly inlining large segments of code. Making local variables global may, in a larger system, reduce the locality of memory access patterns because the most frequently accessed global variables are now spread out over more cache lines. Etc.
TL;DR: I reason about a slow down between 10 and 20 percent in code which runs 60 seconds because I include an echo in the code which is called once and says "hello!".
Jehan I analysed what causes the performance degration and it is not the fact that the proc is encapsulated in another one. This in fact is faster for me and at least one other person.
You write "10%-20% can occur through ... or a compiler backend generating poor code ..." well.
It can also occur because of a compiler having a bug, missing optimizations or is just by design. Same goes for the standard libraries. Afaik we are here in the "Nim Compiler & Language" forum. And I think that everybody should be interested about 10%-20% performance degration when outputing the text "Welcome to brainfuck" in the terminal. That of course probably has a simple explanation and could be unavoidable. I just want the people to double check.
Everything else about your philosophy and lecturing where and when and how I should place my code is not really helpful in this respect. It may be interesting to talk about that but the problem is down to including an echo in the code!
I also know possible (and impossible) reasons why benchmarking has its own problems. But: I was neither benchmarking nor optimizing the code! I just "feelt" (without a clock) that the program was suddenly slower without changing how it works. Then I measured it to find the culprit.
The brainfuck interpreter using input/output and running mandelbrot for a minute also hardly qualifies anymore as a microbenchmark (in my eyes).
It is reproducible every single time on my system. I don't know if you even saw the twist to the original post.
LeuGim I agree. There is also the fact that in a large codebase you sometimes lose track of which procs are still in use. Not every language and setup has dead code warnings and you may also don't want to check for such changes. IF Nim would be slower when doing this I would probably not use it. But I see no reason why that should be the case for the procs we talk about.
TL;DR: I reason about a slow down between 10 and 20 percent in code which runs 60 seconds because I include an echo in the code which is called once and says "hello!".
I can't reproduce this at all here (Linux, x86_64, GCC 4.9.2). With or without the echo it takes the same time to run for me. Maybe there are some other effects at play here, like your C compiler's optimizations or your CPU getting hotter.
And I think that everybody should be interested about 10%-20% performance degration when outputing the text "Welcome to brainfuck" in the terminal.
Yes, and people were interested and the result was "cannot reproduce". At this point you have to use a profiler and tell us why it is that much slower on your machine. Or you actually time the echo operation itself and see if it really takes 6 seconds.
I understand that something which can't be produced on your systems is not helpful.
BTW: I never said (or concluded) that the echo takes 6 seconds! I said that including the echo the code overall is running slower.
I guess that something which is linked, compiled or differently optimized because of the echo statement has the effect of the slowdown. It may just be that the memory layout changes. I have no idea about compiler internals anymore. In the past I wrote C code and tuned it comparing the result with the produced assembler. This was 10 years ago. We aligned structures manually. Today with multiple cores, mapped memory and multiple caches this is all a big mess. My last core system drivers where written for Motorolas CPUs.
So.. Just one more testing on different machines: ESXi VM Debian / iMac i7 / MBPR i7:
It comes down to this:
As it seems "nobody" will be able to reproduce it, just because it is OSX / CLANG specific (which is Apples special clang version). I would be glad to hear from another Mac guy that I am not nuts. I still have another imac and macbook I could test but I feel that is kinda pointless.
I am not sure if I have to apologize for anything. I am not here to bash anybody. Actually I am pleased that you still listen :)
P.S.: I am not sure what to do next. Probably just ignoring the problem. I could always just compile with gcc. For the most part it seems currently not important for the fate of Nim.
P.S.: I am not sure what to do next. Probably just ignoring the problem. I could always just compile with gcc. For the most part it seems currently not important for the fate of Nim.
You could try to narrow down the problem and submit a bug report to whatever is at fault, be it clang, llvm or something apple specific. (or check if they already have one) For what it's worth, on Linux with Clang 3.5.1 I get no slowdown with echo.
OderWat: As it seems "nobody" will be able to reproduce it, just because it is OSX / CLANG specific (which is Apples special clang version). I would be glad to hear from another Mac guy that I am not nuts. I still have another imac and macbook I could test but I feel that is kinda pointless.
I've now been able to (sort of) reproduce it with that specific setup, though there's only about a three second difference and there's considerable variance between runs, but on average it appears to be there. That slowdown disappears when you use echo "x" instead, so most likely it's simply a memory layout issue. I.e. nothing that you can really control.