I wrote an application with Nim and discovered that an application with the same logic and data structures written in Python has a better execution speed than the Nim version.
I was trying to find out about the reason why Nim has a larger execution speed than Python. I isolated some code and tested two small and equivalent Python and Nim programs.
For a loop count of 100000 Python takes 1.0 sec and Nim takes 1.8 sec This speed ratio corresponds to the speed ratio in the real application.
I use Python version 2.7.3 and Nim compiler version 0.13.0 I compiled Nim with with -d:release option.
I'd like to know what is the reason why in this case Python beats Nim in execution speed.
Here are the both programs. The problem is the proc "do_something"
# Nim example to test execution speed against a Python version
import times, os
var board: seq[char] = @['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
const NE: seq[int] = @[8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
const Directions = [NE, NE, NE, NE] # for short: in real application the array elements differ
proc do_something(board: seq[char]): seq[char] =
var res: seq[char] = @[]
for i, p in board:
for d in Directions:
res.add board[d[i]]
return res
let t0 = cpuTime()
var res: seq[char]
let count = 100000
for i in 1..count:
res = do_something(board)
let t1 = cpuTime()
echo "***** Time elapsed for Nim: ", $(t1 - t0), " Counts: ", $count
##echo res.repr
# Python example to test execution speed against a Nim version
import time, sys
board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
directions = [NE, NE, NE, NE] # for short: in real application the array elements differ
def do_something(board):
res = []
for i, p in enumerate(board):
for d in directions:
res.append(board[d[i]])
return res
t0 = time.time()
res = []
count = 100000
for i in range(1, count):
res = do_something(board)
t1 = time.time()
print("***** Time elapsed for Python: ", str(t1 - t0)), " Counts: ", str(count)
##print res
Seqs in nim have value type semantics. Your inner loop ( for d in Directions) actually assigns Directions[...] to d on every iteration which involves full copy of the seq. Here's the patched version. I have also preallocated result seq.
import times, os
var board: seq[char] = @['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
const NE: seq[int] = @[8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
const Directions = [NE, NE, NE, NE] # for short: in real application the array elements differ
proc do_something(board: openarray[char]): seq[char] =
result = newSeq[char](board.len * Directions.len)
var ir = 0
for i, p in board:
for d in 0 ..< Directions.len:
result[ir] = board[Directions[d][i]]
inc ir
let t0 = cpuTime()
var res: seq[char]
let count = 100000
for i in 1..count:
res = do_something(board)
let t1 = cpuTime()
echo "***** Time elapsed for Nim: ", $(t1 - t0), " Counts: ", $count
Optimization, my favorite Nim topic!
Runtime of your version for me: 0.766 s
res.add board[d[i]]
This line resizes the res seq, instead you should allocate it to have the correct size directly. Like this:
proc do_something(board: seq[char]): seq[char] =
var res = newSeq[char](board.len * Directions.len)
for i, p in board:
for j, d in Directions:
res[i * Directions.len + j] = board[d[i]]
return res
New runtime: 0.073 s
With the new devel branch of Nim you can also use newSeqOfCap if you don't want to manually calculate the index:
proc do_something(board: seq[char]): seq[char] =
var res = newSeqOfCap[char](board.len * Directions.len)
for i, p in board:
for d in Directions:
res.add board[d[i]]
return res
As a workaround you can also do var res = newSeq[char](board.len * Directions.len); res.setLen(0) to achieve the same effect.
res = do_something(board)
You create a new res seq for every call to do_something. Instead you could reuse a single data structure and pass it to do_something. But that's probably not what you want to benchmark then.
Full code with some cleanup (use implicit result variable, no $ for echo, arrays instead of seqs, more consts):
# Nim example to test execution speed against a Python version
import times, os
const
board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
Directions = [NE, NE, NE, NE] # for short: in real application the array elements differ
proc do_something(board: openarray[char]): seq[char] =
result = newSeqOfCap[char](board.len * Directions.len)
for i in 0 .. board.high:
for d in Directions:
result.add board[d[i]]
let t0 = cpuTime()
var res: seq[char]
const count = 100_000
for i in 1..count:
res = do_something(board)
let t1 = cpuTime()
echo "***** Time elapsed for Nim: ", t1 - t0, " Counts: ", count
echo res.repr
New runtime: 0.065 s
For comparison, Python runtime: 0.783 s
@def, your version is 0.08 for me. My version is 0.05 ;)
import times, os
const
board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
Directions = [NE, NE, NE, NE] # for short: in real application the array elements differ
proc do_something(board: openarray[char]): seq[char] =
result = newSeqOfCap[char](board.len * Directions.len)
for i in 0 .. board.high:
for d in 0 ..< Directions.len:
result.add board[Directions[d][i]]
let t0 = cpuTime()
var res: seq[char]
const count = 100000
for i in 1..count:
res = do_something(board)
let t1 = cpuTime()
echo "***** Time elapsed for Nim: ", t1 - t0, " Counts: ", count
@yglukhov: Ah, you don't make copies of the entries of Directions. I think that's an optimization the compiler should be able to do somehow. Also, you can use 0 .. Directions.high instead of 0 ..< Directions.len. Using mitems instead achieves the same effect, but needs a var:
# Nim example to test execution speed against a Python version
import times, os
const
board = ['0', 'p', '.', 'p', 'P', '.', 'p', 'P', '.', 'p', '.', 'p', 'P', '.', 'p']
NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
var Directions = [NE, NE, NE, NE] # for short: in real application the array elements differ
proc do_something(board: openarray[char]): seq[char] =
result = newSeqOfCap[char](board.len * Directions.len)
for i in 0 .. board.high:
for d in Directions.mitems:
result.add board[d[i]]
let t0 = cpuTime()
var res: seq[char]
const count = 100_000
for i in 1..count:
res = do_something(board)
let t1 = cpuTime()
echo "***** Time elapsed for Nim: ", t1 - t0, " Counts: ", count
echo res.repr
With the original version I got 1.822 s runtime. With yglukhov's (first) version, I got 0.012 s runtime, replacing Directions.len with Directions.high.
Python runtime (for comparison) not available.
Hi, I am still very new to Nim and played around with the different solutions given here. To me, the biggest difference is made by changing the type of NE from seq to array: With this line:
const NE = @[8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
and 1_000_000 iterations it takes around 11 sec on my Linux box. With this line instead:
const NE = [8, 2, 13, 4, 5, 7, 8, 3, 9, 12, 7, 8, 3, 9, 12]
the time is reduced to ~ 1 sec.The answers were very instructive. Thanks. Two questions.
I do not understand the use of the openarray parameter in relation to performance. Is it better for the speed than using seq[char] ?
If the length of the output array of a proc is not known in advance, are there general rules to accomplish a good performance? For example using tables.
I do not understand the use of the openarray parameter in relation to performance.
openarray has nothing to do with performance. It is just a feature to allow implementing procs that take both an array and a seq as parameter.
The actual performance gain stems from using arrays instead of seqs. Because an array's length is known at compile time, it can be allocated on the stack. Seqs always need to be allocated on the heap. Therefore, it is always advisable to use arrays when possible. Other performance optimizations are explained in the previous posts.
If the length of the output array of a proc is not known in advance, are there general rules to accomplish a good performance?
If you can calculate the target length of the array efficiently before actually filling it, initialize the result seq with as many elements as needed (as shown in the other posts) or use newSeqOfCap if you're on devel or once this has made it into a release.
flyx: Seqs always need to be allocated on the heap.
Sorry, do not now much about it...
May it be possible to have one single Seq in each proc on the stack? Of course it has to be the last element, and can only work when that proc do not call other procs.
let file1 = open(paramStr(1)) let file2 = open(paramStr(2))
let old_lines = file1.readAll().splitLines() let new_lines = file2.readAll().splitLines() file1.close() file2.close()
let old_lines_set = toSet(old_lines) let new_lines_set = toSet(new_lines)
let old_added = old_lines_set - new_lines_set let old_removed = new_lines_set - old_lines_set
Entry: 1/23 Calls: 54/230 = 23.48% [sum: 54; 54/230 = 23.48%]
system.nim: hash 185/230 = 80.43%
sets.nim: rawGet 196/230 = 85.22%
sets.nim: contains 152/230 = 66.09%
test.nim: test 230/230 = 100.00%
Entry: 2/23 Calls: 53/230 = 23.04% [sum: 107; 107/230 = 46.52%]
hashes.nim: !& 95/230 = 41.30%
hashes.nim: hash 185/230 = 80.43%
sets.nim: rawGet 196/230 = 85.22%
sets.nim: contains 152/230 = 66.09%
test.nim: test 230/230 = 100.00%
Entry: 3/23 Calls: 23/230 = 10.00% [sum: 130; 130/230 = 56.52%]
system.nim: hash 185/230 = 80.43%
sets.nim: rawGet 196/230 = 85.22%
sets.nim: incl 52/230 = 22.61%
sets.nim: toSet 43/230 = 18.70%
test.nim: test 230/230 = 100.00%
Entry: 4/23 Calls: 22/230 = 9.57% [sum: 152; 152/230 = 66.09%]
hashes.nim: !& 95/230 = 41.30%
hashes.nim: hash 185/230 = 80.43%
sets.nim: rawGet 196/230 = 85.22%
sets.nim: contains 152/230 = 66.09%
sets.nim: difference 44/230 = 19.13%
sets.nim: - 44/230 = 19.13%
test.nim: test 230/230 = 100.00%
Graph
As you can see, the hash function takes 60% of the time.
I do not know Nim well. Because I do not know what can be changed. But maybe Set type is not the best option?
Python caches hashing of strings. Nim does not (it would be a challenge, as Nim strings are mutable). I suggest using string references or Hash objects if you want to compare performance.
Has anyone benchmarked C++ for this kind of test?
Nim Table and HashSet should be caching the hash code values in their tables and also using that as a "comparison prefix" (comparing string values only if the integer hash codes already match). The string hash of the new inputs is ineliminable - or rather has to be done at least once anyway.
My guess is the lines are long-ish (greater than 10 chars, say). The default Nim hash string function could be faster, especially for longer strings. E.g., the current default does byte-at-a-time manipulations and usually 8-byte/word-at-a-time can get about as good bit mixing much faster, especially for long strings like lines in files. Such a replacement might take that 60% hashing time in this particular benchmark down to 10-15% for a 2X-ish overall speed up.
I have benchmarked C++ for these kinds of tests and I suspect the current STL would not be faster than Nim with the -d:release in this case. (C++, too could benefit from faster default string hash).
(I should perhaps qualify beyond -d:release using gcc or clang with high optimization levels on the back end since jil210 revealed in another thread the baseline 5X performance mystery arose from using tcc as a back end.)
Also, for the curious, the reason C++ STL will usually underperform Nim's hash tables in benchmarks like this is that STL iterator deletion semantics make the natural STL hash table collision resolution implementation choice be external chaining. Those extra linked list indirections add latency, especially when tables do not fit in on-CPU cache.
In C#, we call them IEnumerable<T> which is quite a good name.
That's quite incorrect. openarray is more like the C# params keyword.
Python code is not optimised.
def do_something(board):
res = []
for i, p in enumerate(board):
for d in directions:
res.append(board[d[i]])
return res
should be written as list comprehension:
do_something(board):
return [d[i] for d in directions for i, p in enumerate(board)]
Comprehensions in Python are much faster than ordinary loops. You will gain 2-3x perfomance boost.