nimforum mirror - Strange results with math.sum and stats.mean

jlhouchin (orginal) [2018-06-18T23:40:18+02:00] view original

Hello,

Some time ago I wrote micro-benchmark to test various languages in areas that are important to my application. Arrays, array access, array mutation, iteration, simple math on the arrays and such. I wanted the test to stress things that I do over and over. I wanted it to be significant enough to measure. I also wanted the end results to be comparable so that I could know that each implementation was truly doing the same thing.

Below is my version for Nim. My apologies if the code is not idiomatic or optimized. I had not learned much Nim when I wrote it. I am still in the process of learning and have not edited my original version.

In Xubuntu 18.04, on an i7-3720QM laptop with 16gm ram.

When I compile it using my versions of sum and mean it takes cpuTime(), 295 seconds. When I compile it using the math and stats libraries versions of sum, and mean, it takes cpuTime(), 1422 seconds.

When I decided to try the math and stats libraries I thought that they would be faster because they were written by people who know Nim and math. I am not necessarily good at either. My implementations are simple and naive.

I do not know why there is such a significant difference.

If you choose to try this, you will want to compile with -d:release. It takes a very long time in debug.

My Nim version is about 2x the time of the C/C++ version. About equal with the Python version running Numpy. Java was between C and Nim. But I have no stomach for Java. Everything else was slower, most significantly so. I have not tried Rust. Not really interested now. I have decided Nim is home.

I just wanted to throw this out there to see if anybody understood why math.sum and math.mean are slower and if that is okay?

Feel free to take this code and play with it, optimize. I know it is possibly quite naive in its implementation. But the nice thing is. That whatever you do it has to end up with the same results or it isn't doing the same thing. :)

Should you decide to go for shorter run times, you can reduce the array size (nasize) which is currently 28,800 items. Or the number of repetitions the loops make with the reps variable.

Thanks.

import times
import math, stats

const nasize = 60*24*5*4  #minutes * hours * days * weeks
var na: array[nasize, float]
var nsum = 0.0
var navg = 0.0
let reps = 100

proc normalize(n: float): float =
  result = n
  if result == 0:
    result = 0.000123456789
  while (result <= 0.0001):
    result = result * 10
  while (result >= 1):
    result = result * 0.1

discard """
#false to test differences with math.sum and math.mean
proc sum(arr: array[nasize, float]): float =
  result = 0.0
  for i in items(arr):
    result += i

proc mean(arr: array[nasize, float]): float =
  sum(arr)/nasize
"""

proc loop1(arr: var array[nasize, float], ns, na: var float): float =
  let ct = cpuTime()
  echo "loop1: start: ", $ct
  for i in 1..reps:
    for j in 0..<nasize:
      var n: float = arr[j]
      n = n * (float(i)+n) * (float(j)+1.0-n) * 0.1234567
      arr[j] = normalize((n*n*n))
      nsum = sum(arr)
      navg = mean(arr)
  echo "Loop1 end."
  cpuTime()-ct

proc loop2(arr: var array[nasize, float], ns, na: var float): float =
  let ct = cpuTime()
  echo "loop2: start: ", $ct
  for i in 1..reps:
    for j in 0..<nasize:
      var n: float = arr[j]
      arr[j] = normalize((float(i) * float(j+1) * (n + n + n) * 0.1234567))
      nsum = sum(arr)
      navg = mean(arr)
  echo "Loop2 end."
  cpuTime()-ct

proc initArray(arr: var array[nasize, float]) =
  for i in 0..< nasize:
    arr[i] = normalize(float(i)+1.0)

initArray(na)
let et = epochTime()
let ct = cpuTime()
echo "cputime: ", $ct
echo "Loop1 took: ", $loop1(na, nsum, navg)
echo "Loop2 took: ", $loop2(na, nsum, navg)
echo "finished, cpuTime(): ", $(cpuTime()-ct)
echo "finished, epochTime(): ", $(epochTime()-et)
echo "na size: ", $(len(na))
echo "nsum: ", $nsum
echo "navg: ", $navg
for i in 0..20:
  echo "Item ", $(i+1), ": ", $na[i]

cumulonimbus (orginal) [2018-06-19T09:10:22+02:00] view original

You are using stats.mean(), which does a lot of work you don't need, such as computing variance and kurotsis - but also does something you do need but are likely not aware of, which is a numerically stable mean; all of these cost execution time.

I would guess your C++, Python and Java don't do any of these. Do they?

mratsim (orginal) [2018-06-19T12:26:00+02:00] view original

The stats module from Nim are computing mean and variance using the Welford algorithm, it has the following advantages:

It is one pass-only and is suitable for a stream of data of unknown length

it is numerically stable like @cumulonimbus said

see Welford Algorithm on Wikipedia.

The main disavantages are:

You need to keep track of statistical moments of order 1 and 2 (Nim also keeps track of moment of order 3 and 4 for skewness and kurtosis)

You have divisions in a tight loop which is a costly operation at low-level, classic 2-pass algorithms can get away with only a single division for mean.

If you want to work with multidimensional arrays, I suggest you try either Neo or Arraymancer.

Disclaimer I'm the author of Arraymancer, doc is there. Don't forget to compile with -d:openmp for full multithreading support.

Unfortunately, statistics are still a bit barebones as I'm focusing on data science but you still have PCA, mean, variance and many basic stuff to build your own code efficiently.

cblake (orginal) [2018-06-19T15:07:36+02:00] view original

This is actually a larger topic than it might at first seem and the range of performance is indeed dramatic. The Welford algorithm is not bad accuracy-wise, but, as @mratsim points out, doing a double or long double division for every number when you have an openArray input is very expensive. Floating point divide is one of the most expensive operations on a modern CPU.

How relatively expensive depends on data set size and memory bandwidth of whatever level caches are engaged, of course. For L1-sized data, I measured a simple AVX optimized sum as 160X faster than Welford with long double. With AVX512 this could be about 320X faster - some 2.5 orders of magnitude in speed! That's a huge range - like Python Vs C type range - larger than many might expect naively..Much larger than your 1422/295 =~ 5X range. These techniques use a SIMD vector accumulator M units wide (4-float32 for SSE, 8 for AVX, 16 for AVX512), add up the columns first and then sum the columns at the end. The speed ratios are not all that surprising - they just mean an FP divide is ~20X slower than an AVX add. Several C/C++ compilers can auto-vectorize simple arithmetic loops these days.

So, a big question @jlhouchin is - does your data need precision or speed? Admittedly, that is not always easy to know ahead of time. If you want speed go with Arraymancer. If you want/need precise answers, even the seemingly simple question of how to total floating point numbers has many wrinkles. You don't even need much experience with mathematics or programming to see why. See this stack overflow thread, for example. https. The example in the answers there of what order to add {1, 1 billionth, -1} is thought provoking.

@cumulonimbus, Welford may be numerically "stable", but that doesn't make it the most precise/fastest/best-in-whatever-dimension/recommended algorithm (not that you made that specific claim!). Higher precision usually requires multiple floats and tricks like`https://en.wikipedia.org/wiki/Kahan_summation_algorithm`_ (mentioned in that StackOverflow thread) or per-binary-exponent accumulators as detailed by the Zhu-Hayes 2010 paper on ACM Algorithm 908 (also mentioned in that thread). Per-exponent accumulators are like 1024 bytes for float32 and 16KB for float64. That is either a lot or a little overhead depending on your situation/data/needs. I suspect Kahan summation would be faster and more precise than Welford for almost all real world situations and a good trade-off for a stdlib, but those sorts of questions are more subjective.

Personally, I do not like that the current stats just invokes the online algorithms that update everything it might be asked to compute such as correlations and kurtosis, etc. for its openArray proc variants. Coding-wise, sure that is small and elegant, but run-time-wise it is very inefficient, enough so to confuse probably more people than just @jlhouchin. I also don't like how there is a RunningStat.push, but no RunningStat.pop to remove the influence of a data point, as in a windowed/moving average problem setting.

jlhouchin (orginal) [2018-06-19T16:07:09+02:00] view original

Quick reply I don't have time. More later.

I am not a mathematician or even close. What I really need is not a statistical mean in the sense that you are speaking of. I am simply simulating moving averages. I am averaging. One thing that is confusing to a non-mathematician or CS guy is that the end result of what I want and the end result of what mean provides is the same. Which is one reason I don't see the difference per se.

I do not see an average() proc. It is not hard to do one. sum(arr)/arr.len That is all I need. That is also all that was done in any of the other implementations. The only library mean I used was Numpy's. I don't know what it does either. But it is as fast as the faster Nim implementation.

I read the replies and reply more later.

Thanks.

DeletedUser (orginal) [2018-06-19T19:38:52+02:00] view original

On the other hand, math.sum is very simply implemented akin to your implementation and other languages. I'd recommend not touching stats unless you need statistics, for your case I'd just assume the standard library didn't bother with something as simple as sum(arr) / arr.len. Maybe a disclaimer could be added to stats that says it's best used with multiple operations, or the mean(openarray[T]) proc could have a simpler implementation.

cumulonimbus (orginal) [2018-06-19T20:08:43+02:00] view original

The mean of @[1e300, -1e300, 3] is 1; The mean of @[1e300, 3, -1e300] and @[3, 1e300, -1e300] is 0. Let that sink in. This is an extreme example, and one that the Nim stats.mean() does not solve, but it there are quite a few less-extreme pitfalls to "just summing and dividing" that stats.mean() does solve.

Whoever implemented stats.mean() paid some performance to get more numerical stability. Other disagree with this exact tradeoff. But that's all it is.

If all of the numbers you are averaging are of similar magnitude, you can safely do sum(x)/x.len and forget about everything else. If they are of varying magnitude (but still within a factor of a few millions of each other), then stats.mean() is a better choice. If they vary widely, as in 10^20 of each other, you have no choice but to consider the more advanced algorithms mentioned by @cblake above.

jlhouchin (orginal) [2018-06-19T23:38:03+02:00] view original

Thanks for all the replies. I am very pleased with Nim's performance. Great performance isn't an absolute for me. But I do want good performance. The Python/Numpy implementation is equal to Nim at about 5min. The pure Python implementation takes over 6 hours. That is really what I was wanting to test. The test isn't necessarily representing anything other than array access and numerical operations on a decent sized array simply to do a reasonable stress test. Just busy work.

It accomplished what I wanted. It separated languages which were reasonable and those which were less suitable. LuaJit 6.6 minutes, Lua 3.5 hours.

I suspected that either sum() or mean() might be doing something beyond what I needed. The primary reason I tried them is that I thought maybe they would be written doing some low level magic that I know not of and performed magically.

While I am not a mathematician and I do not understand most of the above. I can appreciate the significant difference in what mean() does and what I need in an average. Which is why I inquired. Thanks for the explanation. I will stick with what I need and not mess with the stats library until such a time that I find a need.

mratsim (orginal) [2018-06-20T12:08:11+02:00] view original

In Python/Numpy, low-level magic is a for loop in C and you should avoid for looping in pure Python.

In Nim, for loop will be compiled into the best code possible so do not worry about performance :).

Having developed and used Nim for numerical computing, I can get performance at the same level or significantly higher than any library in any other language (including C, C++, Fortran and their corresponding Lua, Python, R or Julia wrapper). Only exceptions are assembly routines.

Benches and result in Arraymancer's readme<https://github.com/mratsim/arraymancer#speed>.

jlhouchin (orginal) [2018-06-20T16:10:41+02:00] view original

@mratsim

I am thoroughly enjoying Nim. I find it to be a very nice language. It is taking some time to learn a good work flow for a statically compiled language verses a dynamic one like Python or Smalltalk. I have used Python for a variety of things but never really cared for it. To many inconsistencies for me. I would consider Python/Numpy because of its performance due to Numpy and its ubiquity in data science and numerical computing.

But I think Nim provides a better option. As the numerical, data science improves in Nim. You have one language. No dropping down for performance. Nice.

With Nim I get to use one language which I consider to be nicer than Python and have good performance. I expect that even my naive Nim will perform reasonably well and will always be able to improved as I learn.

Also with Nim, I have the advantage that as I learn. I can explore and learn how well design libraries are written and improve my skills. I like the idea of one language that when mastered will do most anything. And Nim has the best ability I have seen to use the C/C++ ecosystem. Many can use C well but struggle with the C++ side.

With Python, Lua, etc, you will always have a two language situation. One for productivity and one for performance. Nim I think does reasonable on productivity and is demonstrably good on performance.

I look forward to looking into your Arraymancer library.

Thanks for your input and encouragement.

twetzel59 (orginal) [2018-06-20T18:54:39+02:00] view original

An additional point to consider: inlining. If you put your proc in a different module or package, does that change the performance significantly=? Normally we don't worry so much about inlining, but in tight loops it can really impact performance. If you want the C backend to consider inlining a proc in a different module, you can add the {.inline.} pragma:

proc addOneInline(input: int): int {.inline.} =
   input + 1

If this is irrelevant, I apologize :)

mratsim (orginal) [2018-06-20T20:10:59+02:00] view original

We're getting out of topic with perf but I suppose since @jlhouchin has his answer it's fine ;).

@twetzel59, yes inline will help tremendously, additionally -flto (Link-Time-Optimization) may brings huge optimization as well even on a single file as lots of system.nim and standard library are in seperate files.

Here is a benchmark on computational biology

Mirror of forum.nim-lang.org

3943 :: Strange results with math.sum and stats.mean