nimforum mirror - Execution speed Nim vs. Python

akalverboer (orginal) [2016-08-09T13:20:13+02:00] view original

I wrote an application with Nim and discovered that an application with the same logic and data structures written in Python has a better execution speed than the Nim version.

I was trying to find out about the reason why Nim has a larger execution speed than Python. I isolated some code and tested two small and equivalent Python and Nim programs.

For a loop count of 100000 Python takes 1.0 sec and Nim takes 1.8 sec This speed ratio corresponds to the speed ratio in the real application.

I use Python version 2.7.3 and Nim compiler version 0.13.0 I compiled Nim with with -d:release option.

I'd like to know what is the reason why in this case Python beats Nim in execution speed.

Here are the both programs. The problem is the proc "do_something"

# Nim example to test execution speed against a Python version
import times, os

var board: seq[char] = @['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
const NE: seq[int] = @[8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
const Directions = [NE, NE, NE, NE]  # for short: in real application the array elements differ

proc do_something(board: seq[char]): seq[char] =
   var res: seq[char] = @[]
   for i, p in board:
      for d in Directions:
         res.add board[d[i]]
   return res

let t0 = cpuTime()
var res: seq[char]
let count = 100000
for i in 1..count:
   res = do_something(board)
let t1 =  cpuTime()
echo "***** Time elapsed for Nim: ", $(t1 - t0), "  Counts: ", $count
##echo res.repr

# Python example to test execution speed against a Nim version
import time, sys

board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
directions = [NE, NE, NE, NE]  # for short: in real application the array elements differ

def do_something(board):
   res = []
   for i, p in enumerate(board):
      for d in directions:
         res.append(board[d[i]])
   return res

t0 =  time.time()
res = []
count = 100000
for i in range(1, count):
   res = do_something(board)
t1 =   time.time()
print("***** Time elapsed for Python: ", str(t1 - t0)), " Counts: ", str(count)
##print res

yglukhov (orginal) [2016-08-09T13:41:01+02:00] view original

Seqs in nim have value type semantics. Your inner loop ( for d in Directions) actually assigns Directions[...] to d on every iteration which involves full copy of the seq. Here's the patched version. I have also preallocated result seq.

import times, os

var board: seq[char] = @['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
const NE: seq[int] = @[8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
const Directions = [NE, NE, NE, NE]  # for short: in real application the array elements differ

proc do_something(board: openarray[char]): seq[char] =
  result = newSeq[char](board.len * Directions.len)
  var ir = 0
  for i, p in board:
     for d in 0 ..< Directions.len:
        result[ir] = board[Directions[d][i]]
        inc ir

let t0 = cpuTime()
var res: seq[char]
let count = 100000
for i in 1..count:
  res = do_something(board)
let t1 =  cpuTime()
echo "***** Time elapsed for Nim: ", $(t1 - t0), "  Counts: ", $count

def (orginal) [2016-08-09T13:46:37+02:00] view original

Optimization, my favorite Nim topic!

Runtime of your version for me: 0.766 s

res.add board[d[i]]

This line resizes the res seq, instead you should allocate it to have the correct size directly. Like this:

proc do_something(board: seq[char]): seq[char] =
   var res = newSeq[char](board.len * Directions.len)
   for i, p in board:
      for j, d in Directions:
         res[i * Directions.len + j] = board[d[i]]
   return res

New runtime: 0.073 s

With the new devel branch of Nim you can also use newSeqOfCap if you don't want to manually calculate the index:

proc do_something(board: seq[char]): seq[char] =
   var res = newSeqOfCap[char](board.len * Directions.len)
   for i, p in board:
      for d in Directions:
         res.add board[d[i]]
   return res

As a workaround you can also do var res = newSeq[char](board.len * Directions.len); res.setLen(0) to achieve the same effect.

res = do_something(board)

You create a new res seq for every call to do_something. Instead you could reuse a single data structure and pass it to do_something. But that's probably not what you want to benchmark then.

Full code with some cleanup (use implicit result variable, no $ for echo, arrays instead of seqs, more consts):

# Nim example to test execution speed against a Python version
import times, os

const
  board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
  NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
  Directions = [NE, NE, NE, NE]  # for short: in real application the array elements differ

proc do_something(board: openarray[char]): seq[char] =
  result = newSeqOfCap[char](board.len * Directions.len)
  for i in 0 .. board.high:
    for d in Directions:
      result.add board[d[i]]

let t0 = cpuTime()
var res: seq[char]
const count = 100_000
for i in 1..count:
  res = do_something(board)
let t1 =  cpuTime()
echo "***** Time elapsed for Nim: ", t1 - t0, "  Counts: ", count
echo res.repr

New runtime: 0.065 s

For comparison, Python runtime: 0.783 s

yglukhov (orginal) [2016-08-09T13:51:08+02:00] view original

@def, your version is 0.08 for me. My version is 0.05 ;)

import times, os

const
  board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
  NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
  Directions = [NE, NE, NE, NE]  # for short: in real application the array elements differ

proc do_something(board: openarray[char]): seq[char] =
   result = newSeqOfCap[char](board.len * Directions.len)
   for i in 0 .. board.high:
      for d in 0 ..< Directions.len:
         result.add board[Directions[d][i]]

let t0 = cpuTime()
var res: seq[char]
const count = 100000
for i in 1..count:
   res = do_something(board)
let t1 =  cpuTime()
echo "***** Time elapsed for Nim: ", t1 - t0, "  Counts: ", count

def (orginal) [2016-08-09T13:56:18+02:00] view original

@yglukhov: Ah, you don't make copies of the entries of Directions. I think that's an optimization the compiler should be able to do somehow. Also, you can use 0 .. Directions.high instead of 0 ..< Directions.len. Using mitems instead achieves the same effect, but needs a var:

# Nim example to test execution speed against a Python version
import times, os

const
  board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
  NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
var Directions = [NE, NE, NE, NE]  # for short: in real application the array elements differ

proc do_something(board: openarray[char]): seq[char] =
  result = newSeqOfCap[char](board.len * Directions.len)
  for i in 0 .. board.high:
    for d in Directions.mitems:
      result.add board[d[i]]

let t0 = cpuTime()
var res: seq[char]
const count = 100_000
for i in 1..count:
  res = do_something(board)
let t1 =  cpuTime()
echo "***** Time elapsed for Nim: ", t1 - t0, "  Counts: ", count
echo res.repr

Sixte (orginal) [2016-08-09T14:57:44+02:00] view original

With the original version I got 1.822 s runtime. With yglukhov's (first) version, I got 0.012 s runtime, replacing Directions.len with Directions.high.

Python runtime (for comparison) not available.

chemist69 (orginal) [2016-08-09T15:32:09+02:00] view original

Hi, I am still very new to Nim and played around with the different solutions given here. To me, the biggest difference is made by changing the type of NE from seq to array: With this line:

const NE = @[8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]

and 1_000_000 iterations it takes around 11 sec on my Linux box. With this line instead:

const NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]

the time is reduced to ~ 1 sec.

akalverboer (orginal) [2016-08-10T09:23:50+02:00] view original

The answers were very instructive. Thanks. Two questions.

I do not understand the use of the openarray parameter in relation to performance. Is it better for the speed than using seq[char] ?

If the length of the output array of a proc is not known in advance, are there general rules to accomplish a good performance? For example using tables.

flyx (orginal) [2016-08-10T09:35:13+02:00] view original

I do not understand the use of the openarray parameter in relation to performance.

openarray has nothing to do with performance. It is just a feature to allow implementing procs that take both an array and a seq as parameter.

The actual performance gain stems from using arrays instead of seqs. Because an array's length is known at compile time, it can be allocated on the stack. Seqs always need to be allocated on the heap. Therefore, it is always advisable to use arrays when possible. Other performance optimizations are explained in the previous posts.

If the length of the output array of a proc is not known in advance, are there general rules to accomplish a good performance?

If you can calculate the target length of the array efficiently before actually filling it, initialize the result seq with as many elements as needed (as shown in the other posts) or use newSeqOfCap if you're on devel or once this has made it into a release.

Stefan_Salewski (orginal) [2016-08-10T10:04:25+02:00] view original

flyx: Seqs always need to be allocated on the heap.

Sorry, do not now much about it...

May it be possible to have one single Seq in each proc on the stack? Of course it has to be the last element, and can only work when that proc do not call other procs.

didlybom (orginal) [2016-08-11T17:49:02+02:00] view original

I think it was already discussed a while ago but I think this thread shows that "openarray" is not a very good name. IMHO it really does not convey what it really means. Something else such as "anyarray" or "arraylike" might be less accurate but easiert o inderstand (again, IMVHO).

euant (orginal) [2016-08-11T19:22:39+02:00] view original

@didlybom: In C#, we call them IEnumerable<T> which is quite a good name. Something like enumerable[int] or iterable[int] might make more sense than openarray does, though it's a fairly substantial change.

jil210 (orginal) [2017-09-26T17:46:08+02:00] view original

Please help speedup with the following nim code taking five times longer than python version: import os, sets, strutils

if paramCount() != 2:: echo "mydiff file1 file2" quit()

let file1 = open(paramStr(1)) let file2 = open(paramStr(2))

let old_lines = file1.readAll().splitLines() let new_lines = file2.readAll().splitLines() file1.close() file2.close()

let old_lines_set = toSet(old_lines) let new_lines_set = toSet(new_lines)

let old_added = old_lines_set - new_lines_set let old_removed = new_lines_set - old_lines_set

for line in old_lines:

if line in old_added:: echo "-", line.strip()
elif line in old_removed:: echo "+", line.strip()

for line in new_lines:

if line in old_added:: echo "-", line.strip()
elif line in old_removed:: echo "+", line.strip()

Jipok (orginal) [2017-09-27T15:16:30+02:00] view original


Entry: 1/23 Calls: 54/230 = 23.48% [sum: 54; 54/230 = 23.48%]
  system.nim: hash 185/230 = 80.43%
  sets.nim: rawGet 196/230 = 85.22%
  sets.nim: contains 152/230 = 66.09%
  test.nim: test 230/230 = 100.00%
Entry: 2/23 Calls: 53/230 = 23.04% [sum: 107; 107/230 = 46.52%]
  hashes.nim: !& 95/230 = 41.30%
  hashes.nim: hash 185/230 = 80.43%
  sets.nim: rawGet 196/230 = 85.22%
  sets.nim: contains 152/230 = 66.09%
  test.nim: test 230/230 = 100.00%
Entry: 3/23 Calls: 23/230 = 10.00% [sum: 130; 130/230 = 56.52%]
  system.nim: hash 185/230 = 80.43%
  sets.nim: rawGet 196/230 = 85.22%
  sets.nim: incl 52/230 = 22.61%
  sets.nim: toSet 43/230 = 18.70%
  test.nim: test 230/230 = 100.00%
Entry: 4/23 Calls: 22/230 = 9.57% [sum: 152; 152/230 = 66.09%]
  hashes.nim: !& 95/230 = 41.30%
  hashes.nim: hash 185/230 = 80.43%
  sets.nim: rawGet 196/230 = 85.22%
  sets.nim: contains 152/230 = 66.09%
  sets.nim: difference 44/230 = 19.13%
  sets.nim: - 44/230 = 19.13%
  test.nim: test 230/230 = 100.00%

Graph

As you can see, the hash function takes 60% of the time.

I do not know Nim well. Because I do not know what can be changed. But maybe Set type is not the best option?

Varriount (orginal) [2017-09-27T17:52:57+02:00] view original

Python caches hashing of strings. Nim does not (it would be a challenge, as Nim strings are mutable). I suggest using string references or Hash objects if you want to compare performance.

Has anyone benchmarked C++ for this kind of test?

cblake (orginal) [2017-09-27T18:30:07+02:00] view original

Nim Table and HashSet should be caching the hash code values in their tables and also using that as a "comparison prefix" (comparing string values only if the integer hash codes already match). The string hash of the new inputs is ineliminable - or rather has to be done at least once anyway.

My guess is the lines are long-ish (greater than 10 chars, say). The default Nim hash string function could be faster, especially for longer strings. E.g., the current default does byte-at-a-time manipulations and usually 8-byte/word-at-a-time can get about as good bit mixing much faster, especially for long strings like lines in files. Such a replacement might take that 60% hashing time in this particular benchmark down to 10-15% for a 2X-ish overall speed up.

I have benchmarked C++ for these kinds of tests and I suspect the current STL would not be faster than Nim with the -d:release in this case. (C++, too could benefit from faster default string hash).

cblake (orginal) [2017-09-27T18:46:14+02:00] view original

(I should perhaps qualify beyond -d:release using gcc or clang with high optimization levels on the back end since jil210 revealed in another thread the baseline 5X performance mystery arose from using tcc as a back end.)

Also, for the curious, the reason C++ STL will usually underperform Nim's hash tables in benchmarks like this is that STL iterator deletion semantics make the natural STL hash table collision resolution implementation choice be external chaining. Those extra linked list indirections add latency, especially when tables do not fit in on-CPU cache.

jibal (orginal) [2017-09-28T11:05:24+02:00] view original

In C#, we call them IEnumerable<T> which is quite a good name.

That's quite incorrect. openarray is more like the C# params keyword.

siloamx (orginal) [2019-10-29T17:38:14+01:00] view original

Python code is not optimised.

 def do_something(board):
   res = []
   for i, p in enumerate(board):
      for d in directions:
         res.append(board[d[i]])
   return res

should be written as list comprehension:

 do_something(board):
  return [d[i] for d in directions for i, p in enumerate(board)]

Comprehensions in Python are much faster than ordinary loops. You will gain 2-3x perfomance boost.

Mirror of forum.nim-lang.org

2453 :: Execution speed Nim vs. Python