nimforum mirror - Understanding performance compared to Numpy

jlhouchin (orginal) [2019-05-10T23:01:12+02:00] view original

Hello,

I have some code that I am trying to understand why it is slower than Python/Numpy. I am not a C programmer and Nim is my first real attempt to learn a systems level, statically compiled language. But to my understanding Nim should approach C speeds to a certain extent. If what I present can be improved please let me know. I want to learn and I want to use Nim correctly.

I know Python/Numpy has been well optimized for certain tasks. I understand my Nim code may be naive. I would love to see the Nim code that competes with the Numpy code. I want to understand how to write it well.

Ubuntu 18.04 x86-64

Python/Numpy perfoms this at 0:50.0. Nim performs this at: 1:43.151

When I remove the sum() code Nim is 5x faster.

I need fast summing and averaging of slices of arrays. Numpy does this exceptionally.

I know about Arraymancer but am lost at looking at it. I do not do high level maths. I only need simple arrays and accessing, summing, averaging, slicing and doing similar on the slices. I do not need any matrix type operations.

I am okay with doing what I need with Arraymancer should it be the solution to speed up my app. But I am currently clueless on how to do so with the below code.

Any help in accomplishing these goals and competing reasonably with Python/Numpy would be greatly appreciated. I can easily complete my app in Python. And I may do so. But I would like either initially or app v2.x to be able to do it in something like Nim.

I understand very well I may have done something stupid and/or naive. I am willing to learn.

Thanks.

#latest nim dev
proc pricesFromGainCsv*(fullpath: string): array[337588, float] =
  let csvfile = open(fullpath, fmRead)
  let contents = csvfile.readAll()
  let lines = contents.splitLines()
  for i in 1..<lines.len:
    let line = lines[i]
    if line.len > 0:
      let parts = line.split(",")
      result[i-1] = parseFloat(parts[5])
  csvfile.close()

var ctime = getTime()
echo("time: ", ctime)
var cpustart = cpuTime()
#http://ratedata.gaincapital.com/2018/12%20December/EUR_USD_Week1.zip
let gaincsvpath = "/home/jimmie/data/EUR_USD_Week1.txt"
let prices = pricesFromGainCsv(gaincsvpath)
var psum = 0.0
var pcount = -1
var pips = 0.0
var psumsum = 0.0
var farray: array[prices.len-1, float]
for price in prices:
  pcount += 1
  farray[pcount] = ((price * price) * ((pcount+1) / prices.len))
  pips += price
  psum = farray.sum()
  psumsum += psum
echo("pcount: ", $pcount, "  pips: ", $pips, "  farray[^1]: ", farray[^1], "  psum: ", $psum, "  psumsum: ", $psumsum)
echo("clock:  " & $(getTime()-ctime) & "   cpuTime: " & $(cpuTime()-cpustart))

#pcount: 337587  pips: 383662.5627699992  farray[^1]: 1.295654754912296  psum: 218237.9662717213  psumsum: 24517634293.9183
#clock:  1 minute, 43 seconds, 151 milliseconds, 497 microseconds, and 564 nanoseconds   cpuTime: 102.814928774

#python 3.7 latest numpy
#http://ratedata.gaincapital.com/2018/12%20December/EUR_USD_Week1.zip
stime = time()
fpcsv = "/home/jimmie/data/EUR_USD_Week1.txt"
count = -1
pips = 0
psum = 0
psumsum = 0
f = open(fpcsv,"r")
csvlines = f.readlines()[1:]
f.close()
csvreader = csv.reader(csvlines)
pasize = len(csvlines)
parray = np.ndarray(shape=(pasize,),dtype="f8")
for tid,d,pair,dt,b,a in csvreader:
    count += 1
    price = float(a)
    parray[count] = ((price * price) * ((count+1) / pasize))
    pips += price
    psum = parray.sum()
    psumsum += psum
print(count, pips, psum, psumsum, time()-stime0)

#337587 383662.5627699992 218239.26158885768 73674955841.14886 50.33330512046814

Stefan_Salewski (orginal) [2019-05-10T23:39:23+02:00] view original

Unfortunately I can not really understand your code...

But it looks a bit strange for me.

for price in prices:
  pcount += 1
  farray[pcount] = ((price * price) * ((pcount+1) / prices.len))
  pips += price
  psum = farray.sum()

In this loop you modify only one entry of farray but you sum again over that whole array. Would you do that when you had to do it all in your head? (If it is not clear, when you modify only one field of the array, then the sum of whole array we be altered only by that field, so you can store the old sum value, and add (newfieldvalue - oldfieldvalue).

And the other point: In the Nim version you are using a fixed size array for your farray. I don't know Python well, as I was more a Ruby user. But I would assume that the Python array is growing dynamically, so there may be an advantage for Python, as it has not to sum over the full size from the beginning.

And finally, array[337588, float] as a return type of a proc is really strange. Maybe better returning a seq as a dynamic container. Or is that much slower?

mratsim (orginal) [2019-05-10T23:56:45+02:00] view original

Hey there,

This would be an idiomatic Nim translation of your program (and the result is more similar to the Python one).

import os, strutils, times, math, parsecsv, streams

const CsvPath = "./build/EUR_USD_Week1.csv"
# Row: lTid,cDealable,CurrencyPair,RateDateTime,RateBid,RateAsk

proc main() =
  
  var ctime = getTime()
  echo("time: ", ctime)
  var cpustart = cpuTime()
  
  var csv: CsvParser
  let stream = newFileStream(CsvPath, mode = fmRead)
  csv.open( stream, CsvPath,
            separator = ',',
            quote = '\"',
            skipInitialSpace = true
          )
  defer: csv.close
  
  var
    psum = 0.0
    pcount = -1
    pips = 0.0
    psumsum = 0.0
    # Preallocating the seq.
    # Plain arrays are allocated on the stack and
    # stack size is very limited (a couple MB)
    # Use seq or ref arrays instead
    farray = newSeq[float](337588)
  
  discard csv.readRow # Skip header row
  
  while csv.readRow():
    let price = csv.row[5].parseFloat
    pcount += 1
    farray[pcount] = ((price * price) * ((pcount+1) / farray.len))
    pips += price
    psum = farray.sum()
    psumsum += psum
  
  echo("pcount: ", $pcount, "  pips: ", $pips, "  farray[^1]: ", farray[^1], "  psum: ", $psum, "  psumsum: ", $psumsum)
  echo("clock:  " & $(getTime()-ctime) & "   cpuTime: " & $(cpuTime()-cpustart))

main()

Now this idiomatic Nim is still 2.3x slower than Python but I'll have a look at the performance issue later.

Also for your csv processing (and tabular data in general) I suggest you use Python Pandas and NimData they are made for this.

mratsim (orginal) [2019-05-11T00:31:07+02:00] view original

Quick eureka before I sleep.

The bulk of processing is in the sum function for both Python and Nim, as @Stefan_Salewski you should store the previous result because right now the algorithm is doing an extra useless pass on farray, furthermore this relies on the initialization of that empty array being all zero.

Regarding the speed difference between Python and Numpy, I can recover those 2.4x speed difference by compiling the Nim code with nim c -r -d:release --passC:-march=native --passC:-ffast-math build/prices_new.nim.

Slowness explanation

The explanation is a bit complex, it starts from the fact that floating point addition is not associative, i.e. (a + b) + c != a + (b + c) due to floating point rounding. That means that the compiler cannot change the order or your computation without changing the program, -ffast-math allows the compiler to do so.

The second part is that at a low-level each instruction have a latency, floating-point addition have 3 to 5 cycle latency depending on your CPU (a 4GHz CPU executes 4 billions cycles per second). Latency does not impact instructions on independent data but in your case, a sum reuse old data so in a vacuum sum is 3-5 times slower than an elementwise addition.

The way around that is to keep as many accumulator as your addition latency (3 if latency of 3) and sum them at the end, this is what the compiler does with -ffast-math because otherwise he is not allowed to reorder computation.

Laser, the future revamped backend of Arraymancer has several benchmarks of this effect and a sum implementation that reaches the max performance possible (capped by RAM speed) without the need for ffast-math compile flag

A warning about floating point

The magnitude of your number is quite high and you are accumulating a lot of floating-point rounding error especially on your psumsum, which is in the order of 2.4^10, you might not see in Python because it uses arbitrary precision floats (though Numpy does not)

lscrd (orginal) [2019-05-11T01:16:13+02:00] view original

Very interesting explanation.

As regards Python floats, they are native floats (64 bits long in IEEE 754 format), not arbitrary precision floats. Only integers are in arbitrary precision. So, I think that the rounding errors are also present in the Python program.

jlhouchin (orginal) [2019-05-11T02:02:27+02:00] view original

Thanks for all the replies.

I will have to take some time later to analyze the more idiomatic code above and learn.

I understand my code may not appear to make sense. It's primary purpose is simply to provide somewhat of a stress test of what I might commonly do in various methods/functions. I like the idea of a compiled solution which is smaller and light weight verses Numpy and such. My requirements are reasonable performance and do not require the breadth or depth of the Numpy stack.

My code was simply copied and pasted from a playground.nim file where I explore and play around trying to learn and understand.

Simply compiling my code with the above nim c -r -d:release --passC:-march=native --passC:-ffast-math ntplayground.nim worked wonders.

My code as existing now runs in under 30 seconds. Performing better than the Python/Numpy code. Yay!

I am very happy that Nim properly understood performed as I had hoped.

Hopefully I will have time to learn from this thread on Sunday. Time to spend time with the family.

Thanks again.

Mirror of forum.nim-lang.org

4832 :: Understanding performance compared to Numpy