nimforum mirror - NvP: s.add('x') 100M times

HashBackupJim (orginal) [2020-06-24T17:16:04+02:00] view original

This string test uses s.add('x') instead of s = s & x for Nim, and s += 'x' for Python.


ms:nim jim$ cat str1.nim
var
  s: string

for i in 0..100_000_000:
  s.add('x')
echo len(s)

ms:nim jim$ nim c -d:danger str1
Hint: 14210 LOC; 0.275 sec; 15.977MiB peakmem; Dangerous Release build; proj: /Users/jim/nim/str1; out: /Users\
/jim/nim/str1 [SuccessX]

ms:nim jim$ /usr/bin/time -l ./str1
100000001
        0.68 real         0.56 user         0.10 sys
 326627328  maximum resident set size
     79753  page reclaims
         8  page faults
         1  voluntary context switches
         6  involuntary context switches

ms:nim jim$ cat str1.py
s = ''
for i in xrange(100000000):
  s += 'x'
print len(s)

ms:nim jim$ /usr/bin/time -l py str1.py
100000000
       20.74 real        20.67 user         0.06 sys
 105099264  maximum resident set size
     25834  page reclaims
         9  involuntary context switches

Nim blows Python out of the water on this, though it uses 326M of RAM to create a 100M string.

Python's memory use is good, only 105M for a 100M string, but it's slow.

For these tests, I'm not so much looking to find the best way to create a 100M string in Nim or Python. I'm comparing the two to find out where there may be large performance differences, hopefully in Nim's favor, and to get a better understanding of how Nim works.

Araq (orginal) [2020-06-24T17:17:42+02:00] view original

Memory consumption is usually much better with --gc:arc.

HashBackupJim (orginal) [2020-06-24T17:32:43+02:00] view original

Thanks. I tried that just now:


ms:nim jim$ nim c -d:danger --gc:arc str1
Hint: 11937 LOC; 0.390 sec; 12.988MiB peakmem; Dangerous Release build; proj: /Users/jim/nim/str1; out: /Users\
/jim/nim/str1 [SuccessX]

ms:nim jim$ /usr/bin/time -l ./str1
100000001
        0.90 real         0.73 user         0.15 sys
 440176640  maximum resident set size
    107478  page reclaims
         5  page faults
         1  voluntary context switches
         4  involuntary context switches

Does this need 1.3x?

cblake (orginal) [2020-06-24T17:47:22+02:00] view original

For this particular benchmark --gc:boehm uses the least memory and time for me on nim 28510a9da9bf2a6b02590ba27b64e951a208b23d with gcc-10.1 and PGO but that least is still 2.5x the RSS of python-2.7.18. Not sure why, but yeah it is 35x faster than Python.

Araq (orginal) [2020-06-24T17:51:54+02:00] view original

Huh? Tracing GCs should never win this. Something strange is going on... :-)

cblake (orginal) [2020-06-24T18:03:27+02:00] view original

I don't disagree. Might need delving into the generated C to figure out, but I'm guessing my results are not hard to reproduce. If they are let me know how I can best help.

cblake (orginal) [2020-06-24T18:08:44+02:00] view original

Just did a non-PGO regular -d:danger run. Times went up 1.9x but memory usage patterns were the same with gc:arc using much more RSS than gc:boehm or gc:markAndSweep. It's a pretty tiny program.

cumulonimbus (orginal) [2020-06-24T18:39:52+02:00] view original

Possibly something to do with this being main() and not inside a function? Can't think of a reason why for this one, but many benchmarks change significantly (for the better) when put inside a function

cblake (orginal) [2020-06-24T19:00:52+02:00] view original

@cumulonimbus - I tried that. Didn't alter the behavior I was seeing.

If this behavior was not always there then my guess is that some arc bug was causing a crash, got fixed, and now the fix causes this. Regardless of whether it was always there or appeared by bug-jello-squishing accident as I theorize, we should probably have a little suite of "memory use regression" tests to prevent stuff like the scenario I described. Such a suite would be a kind of "correctness testing" for deterministic memory management. Could have a "fuzzy/ball park compare".

Maybe we have such already, perhaps informally? If so, we should add this str1 to it. If not, it can be the first test. :-)

cblake (orginal) [2020-06-24T19:15:34+02:00] view original

Yup. Just what I was seeing, @b3liever. No main()-difference to the RSS delta, and a very noticable delta in a non-intuitive direction. So, either our intuitions are wrong in a way which should be clarified or there's a problem which should be fixed. Maybe a github issue?

sschwarzer (orginal) [2020-06-24T23:23:03+02:00] view original

The reason I suggest comparing against Python 3 is that Python 2 is no longer supported by the CPython project. Also, by far most of the people who start with Python will use Python 3.

If Python 2 is faster in many string benchmarks that's most likely because the default string type in Python 2 is simpler (just bytes) vs. Python 3 (code points). If you see your data as just bytes and want to compare on these grounds, compare with Python 3's bytes type.

Now, when benchmarking Nim vs. Python, should you use a Python version and/or code style because it's more similar in implementation to Nim or should you use a Python version and/or code style because that's how most people would use Python? :-)

By the way, I think it's similar to the question: When benchmarking Nim, should you use the fastest implementation or the most idiomatic/straightforward implementation? I guess it depends.

HashBackupJim (orginal) [2020-06-24T23:24:06+02:00] view original

Thanks for the tip. I knew about this sizing trick for tables, and it did save a lot of RAM ina small test (large table) because it avoided resizes, but wasn't aware strings had a similar thing. I read the Nim manual, but stuff only sticks when doing a lot of coding in a new language and I'm not there yet.

cblake (orginal) [2020-06-24T23:37:01+02:00] view original

@HashBackupJim - newSeqOfCap[T](someLen) also exists and, yes, pre-sizing can help a lot in Nim (and almost any lang that supports it).

Profile-guided-optimization at the gcc level can also help Nim run timings a lot..In this case 1.6x to 1.9x for various gc modes. https://forum.nim-lang.org/t/6295 explains how. LTO also helps since most of the boost of PGO is probably from well chosen inlining.

@sschwartzer - not only string benchmarks...Interpreter start-up/etc. Anyway, this isn't a Python forum, and benchmarking "always depends". :-) :-)

Someone else should reproduce my --gc:arc uses more memory than gc:none for the original str1.nim or one with a main() (or both). I think this kicking the tires has probably uncovered a real problem.

Araq (orginal) [2020-06-25T10:13:09+02:00] view original

I think this kicking the tires has probably uncovered a real problem.

Indeed, there is a high priority bug lurking here, please keep investigating!

cblake (orginal) [2020-06-25T11:55:37+02:00] view original

The same problem happens on nim-1.2.0 (well, nim-devel git hash ed44e524b055b985b0941227a61f84dd8fdfcb37). So, this is a long-lived, maybe since the beginning behavior of gc:arc (but we should still have a memory overuse regression test). Probably time to look at generated C.

cblake (orginal) [2020-06-25T13:05:03+02:00] view original

One other Nim-level thing I can say is that things work as expected for seq[int] of the same memory scale (100MB). I.e.,

proc main() =
  var s: seq[int]
  for i in 0..12_500_000: s.add 1
  echo len(s)
main()

produces a memory report (using /usr/bin/time on Linux) like:


187MB seq2-arc
250MB seq2-default
250MB seq2-msweep
265MB seq2-boehm
300MB seq2-none

So, this problem is only for Nim string. Indeed, if one changes string to seq[char] in the original example, usage goes down to 139MB, roughly what one would expect for a 3/2 growth policy.

Araq (orginal) [2020-06-25T14:39:19+02:00] view original

Nim's strings do a mild form of copy-on-write so that string literals can be copied in O(1). Probably the logic is flawed.

cblake (orginal) [2020-06-25T18:32:42+02:00] view original

I was mistaken. I was compiling my seq test with -d:useMalloc which fixes the problem. Sorry..fiddling with too many knobs.

string and seq[char] behave identically with gc:arc and both get fixed (139MB) with --gc:arc -d:useMalloc. Other GCs (including none) still beat gc:arc-without-useMalloc on string. However, other gc's spend more memory (like 420MB) than gc:arc on seq[char]. So, whatever is going on, seq is actually worse than string, not better. { But also a side note for @HashBackupJim to try -d:useMalloc with --gc:arc. }

At this point, I should raise an issue over at Github. I'll link it back here.

cblake (orginal) [2020-06-25T19:13:10+02:00] view original

Github issue: https://github.com/nim-lang/Nim/issues/14811

Mirror of forum.nim-lang.org

6482 :: NvP: s.add('x') 100M times