I'm strongly considering porting a project to Nim. It is a single 1.7k lines C file (not C++). I've ported the core over with some refactors to see how working with Nim is since the main reason for porting is ergonimic.
Now I'd like to get an idea of how fast the binary will run, but without having to port the whole thing first.
Is there some way to do this? The file is here and if I can give it this input file that would be good enough as a speed test.
I know I can just keep the C file and call the main function but since the main reason is ergonimic, I'd like it to be mostly in Nim and not C or even C-like Nim (unless the two is guarantteed to run at the same speed).
There's c2nim but some of the translation also needs refactoring that automatic translation wouldn't know about. For example
A related question: is there some AST-based Nim refactoring tool? (So I can run pieces through c2nim and then the tool.)
Or should I bite the bullet and port the whole thing by had (and take the loss if the final runtime speed isn't very good)? Another thing that slows down porting is that I don't know how to do certain things in Nim. Like fscanf(f, "%s", buffer).
Finally, has anyone done something like this? Are there accounts of porting C projects in general?
And an unrelated question: Nim can also compile to JS but is there a compatibility layer for some of the functionality like reading from files? For example, can I make it read from a file when compiled to C but make a network request and read from file when compiled to JS (or maybe even just read the content of a DOM element).
Another thing that slows down porting is that I don't know how to do certain things in Nim. Like fscanf(f, "%s", buffer).
I don't know how to do certain things in Nim. Like fscanf(f, "%s", buffer).
Check out readLine
If you haven't already, also check out Nim for C programmers. I am not a C programmer, so I will let someone more familiar with C to document the perfect equivalent of C fscanf in there.
Generally your binary will run about the same speed as C if properly optimised (remember to enable a release build with -d:release, if you run into issues with speed you probably have a lot of unnecessary allocations somewhere).
c2nim is mostly used to create library wrappers, so it's not really meant to convert a project from C to Nim (and even so it would be very C-like). What you could do is profile your original, find which is most used part, and then try to implement that in Nim and call it from C. But getting all that to work is a bit of work, so it might be easier to just port the entire thing to Nim.
The fscanf thing can be done with let f = $buffer where $ is the "to string" operator in Nim, it works with pretty much everything. Plenty of people have ported C code to Nim, myself included. Feel free to ask questions, but it might be faster to get a reply in any of the realtime chats like Discord, Gitter, or IRC (they are all linked together).
Nim has strong compile-time (and meta-programming) capabilities, so having a procedure do two different things depending on the target is not a problem, simply do when defined(js): to check if you're compiling for JavaScript (you could even check for Node.js and support reading from files there if you wanted).
(Also replying to @kaushalmodi.)
So I'm specifically looking to read from a file. And I only need it for "%s", which is to read the next whitespace delimited token. If I use readLine then I need reposition the filehandler, which is what I've done (which seems a bit inefficient and ugly).
proc next_token*(f: File): string =
let start = getFilePos(f)
var buffer = newString(2048)
var read = f.readChars(buffer, 0, len(buffer))
var token = buffer.strip(trailing=false)
let leading = len(buffer) - len(token)
token = token.split()[0]
setFilePos(f, start + len(token) + leading)
return token
Another way I thought of is to read one character at a time and piece it together.
the main reason for porting is ergonimic
Love it!
Thanks. If its fairly easy to get speeds comparable to C then it'd be worth a shot. There were no allocations except at the very beginning in the C version. Though in the port I plan to use some allocations for strings. Does seq growing and shrinking a lot result in a lot of allocations?
I'll have a look at the chatroom though getting same day answered without being real-time is already pretty helpful.
when defined works! I definitely have other meta-programming questions, like
And I only need it for "%s", which is to read the next whitespace delimited token.
Then strscans or npeg will do the job.
Does npeg work on strings or just files?
Check out https://github.com/zevv/npeg#quickstart .
npeg works on strings, and so it works on files too.
Seq's have a fixed growth rate and shrinking them using setLen shouldnt change the capacity.
Macros have very good introspection, you can read the entire procedure body from one and even add to/remove parts of it.
As procedures have a implict result, you can either make a macro to set the default, or manually set it on the first line as such
proc lacksZero(a: openArray[int]): int =
result = true
for x in a:
if x == 0: return false
Hmm, so unless I'm misunderstanding, that example reads all the tokens from the file and returns them all in a seq but I'm looking to only read one token at a time so that not all tokens are in memory at the same time. So if the file contents is
one two 1234 )(*& three
then I'd like the first call to next_token should return one, the second call next_token should return two and so on.
The other reason I wanted tokens to be streamed in order to have a smaller special case when the file is stdin.
Good idea, I ran it on a smaller example (precompiled/compiler.f) and the results were the same. My guess is that there isn't that much to learn without knowing more about the semantic meaning of the running program.
The entire compilation
nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-generate" --passL:"-flto -fprofile-instr-generate" --gc:orc flpc.nim
echo | ./flpc precompiled/compiler.f
llvm-profdata merge default.profraw -output data.profdata
nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-use=data.profdata" --passL:"-flto -fprofile-instr-use=data.profdata" --gc:orc flpc.nim
is now down to 10.8 s, which is pretty good considering just -d:danger --gc:orc is 2.4 s.
I also found out one of the reasons for the recent speed up. I replaced .reversed (from algorithm) with an iteration over indices
for elem in g.data_stack.reversed():
...
became
for i in countdown(g.data_stack.len - 1, 0):
let elem = g.data_stack[i]
...
I've made some tweaks, including changes based on @PMunch's suggestions for next_token.
With nimprof, I'm getting these
Entry: 1/143 Calls: 193/1580 = 12.22% [sum: 193; 193/1580 = 12.22%]
lib/system.nim: [] 353/1580 = 22.34%
flpc.nim: step 1419/1580 = 89.81%
flpc.nim: main_loop 1557/1580 = 98.54%
flpc.nim: main 1578/1580 = 99.87%
flpc.nim: flpc 1580/1580 = 100.00%
Entry: 2/143 Calls: 187/1580 = 11.84% [sum: 380; 380/1580 = 24.05%]
flpc.nim: func_call 1169/1580 = 73.99%
flpc.nim: step 1419/1580 = 89.81%
flpc.nim: main_loop 1557/1580 = 98.54%
flpc.nim: main 1578/1580 = 99.87%
flpc.nim: flpc 1580/1580 = 100.00%
Entry: 3/143 Calls: 119/1580 = 7.53% [sum: 499; 499/1580 = 31.58%]
flpc.nim: ds_add 121/1580 = 7.66%
flpc.nim: func_call 1169/1580 = 73.99%
flpc.nim: step 1419/1580 = 89.81%
flpc.nim: main_loop 1557/1580 = 98.54%
flpc.nim: main 1578/1580 = 99.87%
flpc.nim: flpc 1580/1580 = 100.00%
with --gc:orc but no PGO. Full results here.
Is there a easy way to get line-based profiling for specific functions? I could try to wrap them in individual procs for those that don't need local vars.
Also, does inlining happen automatically? I know, for example, that ds_add and ds_pop are good candidates to be inlined. Should I replace them with compile time macros?
I haven't looked into callgrind yet since I haven't used it before but is it really supported?
Also, does inlining happen automatically?
yes. and you can encourage that with {.inline.} but often the compiler knows better than you.
this cppcon talk has some good slides on the effects of inlining, starting with an example of it leading to slower code, and then showing the results of experiments disabling programmers' manual always-inline/never-inline annotations.
a macro is overkill, changing proc to template is a way to 'always-inline', if the compiler is ignoring your hints and you're sure you know better
I'm happy to let the compiler pick though it'd be nice if compiler optimization could be more structured and verbose so we can try to figure out if any manual changes will have the effect we want.
Adding {.inline} seems to speed it up a bit but not significantly so it could also be measurement error.
Is there some way to hint that a variable should have a register dedicated to it? I think C has a register keyword. The variable is the last element of a seq so I'd need to change the code to split it out first.
The gcc guys say the main boost from PGO is better inlining decisions (but it's hard to be perfect - see occasional sensitivity to training data in my comment above - a very different kind of measurement error).
register has meant even less than inline forever. Optimizing C compilers do their own fancy register allocation. Even if they didn't, smart CPUs map the newest stack entries to the register file (which from its name itself may indicate another level of indirection also misleading about the archaic intent of register). That said, there are ways to hint at register subsets. They will be C compiler specific and probably not help your performance much, though. You can maybe access them via emit in Nim.
These days if you just want to make your code faster then your developer time is usually better spent looking at cache performance with perf stat, cachegrind or something. You seem to have a simpler case of a streaming parser. So, branch (mis)prediction and internal, quite possibly data-dependent pipeline bubbles are likely more important than register decisions, but it is hard to make any absolute statements about such a complex, noisy system.
In any case, it seems your Nim is already faster than your C. There is almost always more work you can do to try to speed things up, but at some point there are diminishing returns. My advice would be to look for other things like your reversed mistake, minimize allocations, and only worry about microarchitectural stuff after you are sure there isn't a lot of wasted work. Otherwise when you do find wasted work, all your microarch will probably be invalidated/you'd have to start over.
one thing i tried was making your 'Specials' const, just as you have them in the c code. it seemed to speed things up for me, (from ~6.2s to ~5.6s on my laptop) but only on 1.4.6 and there's lots of noise.
the idea behind it was that that the test in ds_pop might be optimizable if the values were known at compile-time to gcc.
to do that, i just changed the let to const, and the specials table to {.compiletime.}. however, there's a bug/issue where compiletime values dont get transferred over to runtime, so you have to do this workaround:
var specials_ct{.compiletime.}: Table[int,string]
proc fSpecial(x:int,y:string):int =
specials_ct[x]=y
x
const
foo = fSpecial(5,"foo")
bar = fSpecial(6,"bar")
specials = specials_ct
These days if you just want to make your code faster then your developer time is usually better spent looking at cache performance with perf stat, cachegrind or something. You seem to have a simpler case of a streaming parser. So, branch (mis)prediction and internal, quite possibly data-dependent pipeline bubbles are likely more important than register decisions, but it is hard to make any absolute statements about such a complex, noisy system.
I'm not really familiar with these tools so I mostly just followed https://developers.redhat.com/blog/2014/03/10/determining-whether-an-application-has-poor-cache-performance-2
Branch misses don't seem that bad
Performance counter stats for './flpc precompiled/self-stage0.f':
3,489.71 msec task-clock:u # 0.707 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
25,824 page-faults:u # 0.007 M/sec
11,186,650,622 cycles:u # 3.206 GHz
40,194,947,838 instructions:u # 3.59 insn per cycle
8,312,613,612 branches:u # 2382.035 M/sec
13,058,104 branch-misses:u # 0.16% of all branches
but cache misses can be high
Performance counter stats for './flpc precompiled/self-stage0.f':
3,467.17 msec task-clock:u # 0.859 CPUs utilized
10,881,612,884 cycles:u # 3.138 GHz
40,194,947,556 instructions:u # 3.69 insn per cycle
3,479,750 cache-references:u # 1.004 M/sec
1,992,847 cache-misses:u # 57.270 % of all cache refs
44.01% flpc libc-x.x.so [.] __memmove_avx_unaligned_erms
12.24% flpc flpc [.] main_loop__t47znkueRRIwI9bUeHkTfNg_12
10.45% flpc libc-x.x.so [.] __memset_avx2_erms
9.93% flpc flpc [.] func_call__yY7Yz9cePkdbDBIiAVdCOJA_2
I don't know where memmove and memset are called.
│ 50: mov %rdx,%rcx ▒
71.20 │ rep movsb %ds:(%rsi),%es:(%rdi) ▒
│ 55: ← retq
Is this from some kind of string operation? I also can't get it to show source lines. Tried --debugger:native and adding -g to --passC and --passL. It seems to be less bad on a larger input (precompiled/self.f).
16.88% flpc libc-x.x.so [.] __memmove_avx_unaligned_erms
14.51% flpc flpc [.] func_call__yY7Yz9cePkdbDBIiAVdCOJA>
12.62% flpc libc-x.x.so [.] __memset_avx2_erms
12.33% flpc flpc [.] main_loop__t47znkueRRIwI9bUeHkTfNg>
In any case, it seems your Nim is already faster than your C.
I hadn't checked but it seems that with all the changes made, the versions without PGO is comparable for C and Nim. And the PGO version for Nim is faster than the PGO version for C.
Is there some way to keep a bit more of the stack trace without slowing it down too much? With --stacktrace:on? With C and gdb, I can usually get some frames even with optimization. Or alternatively, is there some way to call Nim functions from gdb (I use call and print in C) without too much slowdown?
My advice would be to look for other things like your reversed mistake, minimize allocations, and only worry about microarchitectural stuff after you are sure there isn't a lot of wasted work.
Thanks for the tips!
I'm was going to try to optimize the sequence of primitives executed. Part of the advantage of porting to Nim make trying things easier.
I found reversed through profiling and am not seeing anything else obvious at the moment. So it'd have to be scattered and not showing up in profiling results.
What are sources of allocation? I think there's only new strings and .add for Seq in this program but is there something that makes allocations that I'm not aware of?
Thanks. I should probably keep those as const anyways so its worth changing even if speed improvements is uncertain.
an idiomatic Nim solution might be to use the fact that you can specify the string value of an enum really easily, thereby doing without the extra table entirely, but i don't know how much work that would be to turn your Specials into an enum
So I don't really need the specials table at all since its only read in one place and doesn't change after initialization.
if specials.contains(elem.value):
return fmt"<{specials[elem.value]}>"
else:
return fmt"<special {elem.value}>"
For example a fixed lookup function would work as well (+ the implicit .str as you suggest).
However, Nim seems to give enums consecutive integer values but (currently) they have values 8*i + 5 (so 5, 13, 21, ...) and I'd need conversion both ways.