nimforum mirror - Porting to Nim

asrp (orginal) [2021-05-25T15:34:19+02:00] view original

I'm strongly considering porting a project to Nim. It is a single 1.7k lines C file (not C++). I've ported the core over with some refactors to see how working with Nim is since the main reason for porting is ergonimic.

Now I'd like to get an idea of how fast the binary will run, but without having to port the whole thing first.

Is there some way to do this? The file is here and if I can give it this input file that would be good enough as a speed test.

I know I can just keep the C file and call the main function but since the main reason is ergonimic, I'd like it to be mostly in Nim and not C or even C-like Nim (unless the two is guarantteed to run at the same speed).

There's c2nim but some of the translation also needs refactoring that automatic translation wouldn't know about. For example

init(Int, x) -> fInt(x)

cint -> int

a[LENGTH] -> a.len

A related question: is there some AST-based Nim refactoring tool? (So I can run pieces through c2nim and then the tool.)

Or should I bite the bullet and port the whole thing by had (and take the loss if the final runtime speed isn't very good)? Another thing that slows down porting is that I don't know how to do certain things in Nim. Like fscanf(f, "%s", buffer).

Finally, has anyone done something like this? Are there accounts of porting C projects in general?

And an unrelated question: Nim can also compile to JS but is there a compatibility layer for some of the functionality like reading from files? For example, can I make it read from a file when compiled to C but make a network request and read from file when compiled to JS (or maybe even just read the content of a DOM element).

miran (orginal) [2021-05-25T15:41:39+02:00] view original

Another thing that slows down porting is that I don't know how to do certain things in Nim. Like fscanf(f, "%s", buffer).

https://nim-lang.github.io/Nim/strscans.html

kaushalmodi (orginal) [2021-05-25T15:42:57+02:00] view original

I don't know how to do certain things in Nim. Like fscanf(f, "%s", buffer).

Check out readLine

If you haven't already, also check out Nim for C programmers. I am not a C programmer, so I will let someone more familiar with C to document the perfect equivalent of C fscanf in there.

PMunch (orginal) [2021-05-25T16:04:45+02:00] view original

Generally your binary will run about the same speed as C if properly optimised (remember to enable a release build with -d:release, if you run into issues with speed you probably have a lot of unnecessary allocations somewhere).

c2nim is mostly used to create library wrappers, so it's not really meant to convert a project from C to Nim (and even so it would be very C-like). What you could do is profile your original, find which is most used part, and then try to implement that in Nim and call it from C. But getting all that to work is a bit of work, so it might be easier to just port the entire thing to Nim.

The fscanf thing can be done with let f = $buffer where $ is the "to string" operator in Nim, it works with pretty much everything. Plenty of people have ported C code to Nim, myself included. Feel free to ask questions, but it might be faster to get a reply in any of the realtime chats like Discord, Gitter, or IRC (they are all linked together).

Nim has strong compile-time (and meta-programming) capabilities, so having a procedure do two different things depending on the target is not a problem, simply do when defined(js): to check if you're compiling for JavaScript (you could even check for Node.js and support reading from files there if you wanted).

asrp (orginal) [2021-05-25T22:42:05+02:00] view original

(Also replying to @kaushalmodi.)

So I'm specifically looking to read from a file. And I only need it for "%s", which is to read the next whitespace delimited token. If I use readLine then I need reposition the filehandler, which is what I've done (which seems a bit inefficient and ugly).

proc next_token*(f: File): string =
  let start = getFilePos(f)
  var buffer = newString(2048)
  var read = f.readChars(buffer, 0, len(buffer))
  var token = buffer.strip(trailing=false)
  let leading = len(buffer) - len(token)
  token = token.split()[0]
  setFilePos(f, start + len(token) + leading)
  return token

Another way I thought of is to read one character at a time and piece it together.

bpr (orginal) [2021-05-25T23:09:00+02:00] view original

the main reason for porting is ergonimic

Love it!

asrp (orginal) [2021-05-25T23:19:55+02:00] view original

Thanks. If its fairly easy to get speeds comparable to C then it'd be worth a shot. There were no allocations except at the very beginning in the C version. Though in the port I plan to use some allocations for strings. Does seq growing and shrinking a lot result in a lot of allocations?

I'll have a look at the chatroom though getting same day answered without being real-time is already pretty helpful.

when defined works! I definitely have other meta-programming questions, like

Can I read the number of arguments to a function? And check they are all the same type?

Can I return some default value (for some proc`s) in case there's no `return called?

kaushalmodi (orginal) [2021-05-25T23:22:06+02:00] view original

And I only need it for "%s", which is to read the next whitespace delimited token.

Then strscans or npeg will do the job.

Does npeg work on strings or just files?

Check out https://github.com/zevv/npeg#quickstart .

npeg works on strings, and so it works on files too.

ElegantBeef (orginal) [2021-05-25T23:38:13+02:00] view original

Seq's have a fixed growth rate and shrinking them using setLen shouldnt change the capacity.

Macros have very good introspection, you can read the entire procedure body from one and even add to/remove parts of it.

As procedures have a implict result, you can either make a macro to set the default, or manually set it on the first line as such

proc lacksZero(a: openArray[int]): int =
  result = true
  for x in a:
    if x == 0: return false

asrp (orginal) [2021-05-26T00:38:48+02:00] view original

Hmm, so unless I'm misunderstanding, that example reads all the tokens from the file and returns them all in a seq but I'm looking to only read one token at a time so that not all tokens are in memory at the same time. So if the file contents is


   one two 1234 )(*& three

then I'd like the first call to next_token should return one, the second call next_token should return two and so on.

The other reason I wanted tokens to be streamed in order to have a smaller special case when the file is stdin.

asrp (orginal) [2021-06-01T03:34:55+02:00] view original

Good idea, I ran it on a smaller example (precompiled/compiler.f) and the results were the same. My guess is that there isn't that much to learn without knowing more about the semantic meaning of the running program.

The entire compilation


nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-generate" --passL:"-flto -fprofile-instr-generate" --gc:orc flpc.nim
echo | ./flpc precompiled/compiler.f
llvm-profdata merge default.profraw -output data.profdata
nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-use=data.profdata" --passL:"-flto -fprofile-instr-use=data.profdata" --gc:orc flpc.nim

is now down to 10.8 s, which is pretty good considering just -d:danger --gc:orc is 2.4 s.

I also found out one of the reasons for the recent speed up. I replaced .reversed (from algorithm) with an iteration over indices


for elem in g.data_stack.reversed():
  ...

became


for i in countdown(g.data_stack.len - 1, 0):
  let elem = g.data_stack[i]
  ...

asrp (orginal) [2021-06-01T20:50:47+02:00] view original

I've made some tweaks, including changes based on @PMunch's suggestions for next_token.

With nimprof, I'm getting these


Entry: 1/143 Calls: 193/1580 = 12.22% [sum: 193; 193/1580 = 12.22%]
  lib/system.nim: [] 353/1580 = 22.34%
  flpc.nim: step 1419/1580 = 89.81%
  flpc.nim: main_loop 1557/1580 = 98.54%
  flpc.nim: main 1578/1580 = 99.87%
  flpc.nim: flpc 1580/1580 = 100.00%
Entry: 2/143 Calls: 187/1580 = 11.84% [sum: 380; 380/1580 = 24.05%]
  flpc.nim: func_call 1169/1580 = 73.99%
  flpc.nim: step 1419/1580 = 89.81%
  flpc.nim: main_loop 1557/1580 = 98.54%
  flpc.nim: main 1578/1580 = 99.87%
  flpc.nim: flpc 1580/1580 = 100.00%
Entry: 3/143 Calls: 119/1580 = 7.53% [sum: 499; 499/1580 = 31.58%]
  flpc.nim: ds_add 121/1580 = 7.66%
  flpc.nim: func_call 1169/1580 = 73.99%
  flpc.nim: step 1419/1580 = 89.81%
  flpc.nim: main_loop 1557/1580 = 98.54%
  flpc.nim: main 1578/1580 = 99.87%
  flpc.nim: flpc 1580/1580 = 100.00%

with --gc:orc but no PGO. Full results here.

Is there a easy way to get line-based profiling for specific functions? I could try to wrap them in individual procs for those that don't need local vars.

Also, does inlining happen automatically? I know, for example, that ds_add and ds_pop are good candidates to be inlined. Should I replace them with compile time macros?

I haven't looked into callgrind yet since I haven't used it before but is it really supported?

ElegantBeef (orginal) [2021-06-01T21:00:58+02:00] view original

Well it gives you all the information you need so I'd say it is "supported", as long as you compile with --debugger:native you get a lot of information for performance profiling as if it was C(well it is).

shirleyquirk (orginal) [2021-06-01T22:02:49+02:00] view original

Also, does inlining happen automatically?

yes. and you can encourage that with {.inline.} but often the compiler knows better than you.

this cppcon talk has some good slides on the effects of inlining, starting with an example of it leading to slower code, and then showing the results of experiments disabling programmers' manual always-inline/never-inline annotations.

a macro is overkill, changing proc to template is a way to 'always-inline', if the compiler is ignoring your hints and you're sure you know better

asrp (orginal) [2021-06-02T01:55:32+02:00] view original

I'm happy to let the compiler pick though it'd be nice if compiler optimization could be more structured and verbose so we can try to figure out if any manual changes will have the effect we want.

Adding {.inline} seems to speed it up a bit but not significantly so it could also be measurement error.

Is there some way to hint that a variable should have a register dedicated to it? I think C has a register keyword. The variable is the last element of a seq so I'd need to change the code to split it out first.

cblake (orginal) [2021-06-02T12:46:07+02:00] view original

The gcc guys say the main boost from PGO is better inlining decisions (but it's hard to be perfect - see occasional sensitivity to training data in my comment above - a very different kind of measurement error).

register has meant even less than inline forever. Optimizing C compilers do their own fancy register allocation. Even if they didn't, smart CPUs map the newest stack entries to the register file (which from its name itself may indicate another level of indirection also misleading about the archaic intent of register). That said, there are ways to hint at register subsets. They will be C compiler specific and probably not help your performance much, though. You can maybe access them via emit in Nim.

These days if you just want to make your code faster then your developer time is usually better spent looking at cache performance with perf stat, cachegrind or something. You seem to have a simpler case of a streaming parser. So, branch (mis)prediction and internal, quite possibly data-dependent pipeline bubbles are likely more important than register decisions, but it is hard to make any absolute statements about such a complex, noisy system.

In any case, it seems your Nim is already faster than your C. There is almost always more work you can do to try to speed things up, but at some point there are diminishing returns. My advice would be to look for other things like your reversed mistake, minimize allocations, and only worry about microarchitectural stuff after you are sure there isn't a lot of wasted work. Otherwise when you do find wasted work, all your microarch will probably be invalidated/you'd have to start over.

shirleyquirk (orginal) [2021-06-02T14:03:36+02:00] view original

one thing i tried was making your 'Specials' const, just as you have them in the c code. it seemed to speed things up for me, (from ~6.2s to ~5.6s on my laptop) but only on 1.4.6 and there's lots of noise.

the idea behind it was that that the test in ds_pop might be optimizable if the values were known at compile-time to gcc.

to do that, i just changed the let to const, and the specials table to {.compiletime.}. however, there's a bug/issue where compiletime values dont get transferred over to runtime, so you have to do this workaround:

var specials_ct{.compiletime.}: Table[int,string]
proc fSpecial(x:int,y:string):int =
  specials_ct[x]=y
  x
const
  foo = fSpecial(5,"foo")
  bar = fSpecial(6,"bar")
  specials = specials_ct

asrp (orginal) [2021-06-03T23:54:53+02:00] view original

These days if you just want to make your code faster then your developer time is usually better spent looking at cache performance with perf stat, cachegrind or something. You seem to have a simpler case of a streaming parser. So, branch (mis)prediction and internal, quite possibly data-dependent pipeline bubbles are likely more important than register decisions, but it is hard to make any absolute statements about such a complex, noisy system.

I'm not really familiar with these tools so I mostly just followed https://developers.redhat.com/blog/2014/03/10/determining-whether-an-application-has-poor-cache-performance-2

Branch misses don't seem that bad


 Performance counter stats for './flpc precompiled/self-stage0.f':
          
          3,489.71 msec task-clock:u              #    0.707 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
            25,824      page-faults:u             #    0.007 M/sec
    11,186,650,622      cycles:u                  #    3.206 GHz
    40,194,947,838      instructions:u            #    3.59  insn per cycle
     8,312,613,612      branches:u                # 2382.035 M/sec
        13,058,104      branch-misses:u           #    0.16% of all branches

but cache misses can be high


 Performance counter stats for './flpc precompiled/self-stage0.f':
          
          3,467.17 msec task-clock:u              #    0.859 CPUs utilized
    10,881,612,884      cycles:u                  #    3.138 GHz
    40,194,947,556      instructions:u            #    3.69  insn per cycle
         3,479,750      cache-references:u        #    1.004 M/sec
         1,992,847      cache-misses:u            #   57.270 % of all cache refs


    44.01%  flpc     libc-x.x.so      [.] __memmove_avx_unaligned_erms
    12.24%  flpc     flpc              [.] main_loop__t47znkueRRIwI9bUeHkTfNg_12
    10.45%  flpc     libc-x.x.so      [.] __memset_avx2_erms
     9.93%  flpc     flpc              [.] func_call__yY7Yz9cePkdbDBIiAVdCOJA_2

I don't know where memmove and memset are called.


       │ 50:   mov        %rdx,%rcx                                            ▒
 71.20 │       rep        movsb %ds:(%rsi),%es:(%rdi)                          ▒
       │ 55: ← retq

Is this from some kind of string operation? I also can't get it to show source lines. Tried --debugger:native and adding -g to --passC and --passL. It seems to be less bad on a larger input (precompiled/self.f).


    16.88%  flpc     libc-x.x.so        [.] __memmove_avx_unaligned_erms
    14.51%  flpc     flpc                [.] func_call__yY7Yz9cePkdbDBIiAVdCOJA>
    12.62%  flpc     libc-x.x.so        [.] __memset_avx2_erms
    12.33%  flpc     flpc                [.] main_loop__t47znkueRRIwI9bUeHkTfNg>

In any case, it seems your Nim is already faster than your C.

I hadn't checked but it seems that with all the changes made, the versions without PGO is comparable for C and Nim. And the PGO version for Nim is faster than the PGO version for C.

Is there some way to keep a bit more of the stack trace without slowing it down too much? With --stacktrace:on? With C and gdb, I can usually get some frames even with optimization. Or alternatively, is there some way to call Nim functions from gdb (I use call and print in C) without too much slowdown?

My advice would be to look for other things like your reversed mistake, minimize allocations, and only worry about microarchitectural stuff after you are sure there isn't a lot of wasted work.

Thanks for the tips!

I'm was going to try to optimize the sequence of primitives executed. Part of the advantage of porting to Nim make trying things easier.

I found reversed through profiling and am not seeing anything else obvious at the moment. So it'd have to be scattered and not showing up in profiling results.

What are sources of allocation? I think there's only new strings and .add for Seq in this program but is there something that makes allocations that I'm not aware of?

asrp (orginal) [2021-06-04T00:20:56+02:00] view original

Thanks. I should probably keep those as const anyways so its worth changing even if speed improvements is uncertain.

an idiomatic Nim solution might be to use the fact that you can specify the string value of an enum really easily, thereby doing without the extra table entirely, but i don't know how much work that would be to turn your Specials into an enum

So I don't really need the specials table at all since its only read in one place and doesn't change after initialization.


      if specials.contains(elem.value):
        return fmt"<{specials[elem.value]}>"
      else:
        return fmt"<special {elem.value}>"

For example a fixed lookup function would work as well (+ the implicit .str as you suggest).

However, Nim seems to give enums consecutive integer values but (currently) they have values 8*i + 5 (so 5, 13, 21, ...) and I'd need conversion both ways.

Mirror of forum.nim-lang.org

8032 :: Porting to Nim