nimforum mirror - NvP: creating and sorting a large array of tuples

HashBackupJim (orginal) [2020-07-04T16:38:51+02:00] view original

In HashBackup (a Python app), planning a restore requires pre-traversal of all data blocks to be restored to decide how to manage cache during the restore. For multi-TB restores this can take something like 5 minutes, which isn't great.

This small test compares the creation and sorting of a 10M array of tuples in Python and Nim. First is Python:


import random
import time

r = random.randint
a = [(r(0, 1000000),i) for i in xrange(10000000)]
print 'sorting'
t = time.time()
a.sort()
print 'done', time.time()-t

ms:nim jim$ /usr/bin/time -l py sort2.py
sorting
done 29.7032859325
       47.61 real        46.82 user         0.76 sys
1389780992  maximum resident set size
    374738  page reclaims
       408  page faults
        11  block input operations
        19  block output operations
        67  voluntary context switches
       127  involuntary context switches

Now Nim:


import random
import algorithm

const
  els = 10_000_000
  maxval = 1_000_000

type E = object
  r: int32
  i: int32

type A = object
  a: array[els, E]

func cmp(a, b: E): int = cmp(a.r, b.r)

proc main() =
  var
    a: ref A
  
  new a
  for i in 0..high(a.a):
    a.a[i].r = int32(rand(maxval))
    a.a[i].i = int32(i)
  a.a.sort(cmp)
  echo a.a[0..<5], " ... ", a.a[^5..^1]

main()

ms:nim jim$ nim c -d:danger sort2
Hint: 32979 LOC; 0.850 sec; 47.047MiB peakmem; Dangerous Release build; proj: /Users/jim/nim/sort2; out: /Users/jim/nim/sort2 [SuccessX]

ms:nim jim$ /usr/bin/time -l ./sort2
@[(r: 0, i: 861925), (r: 0, i: 1018886), (r: 0, i: 1112145), (r: 0, i: 3095791), (r: 0, i: 3127185)] ... @[(r: 1000000, i: 5801839), (r: 1000000, i: 6467043), (r: 1000000, i: 6768282), (r: 1000000, i: 7387067), (r: 1000000, i: 8454119)]
        2.28 real         2.23 user         0.04 sys
 120729600  maximum resident set size
     29483  page reclaims
        10  page faults
         1  voluntary context switches
        16  involuntary context switches

This is a great example of where Nim shines. Not only is it ~21x faster, it also used 11.5x less memory: 120M vs ~1.4G.

In the actual Python app, this huge memory use was a major issue to work around because each unique integer in Python is a 24-byte structure (28 bytes for 64-bit integers). Instead of creating a list of tuples in memory, I had to write the tuples to a file, sort the file, then read the sorted file to create the data structure I actually needed, and it still used around 400MB. I could have maybe used Python's array.array somehow, by bit-shifting 2 32-bit inteers into a 64-bit array element, but arrays can't be sorted. I could have used NumPy, and probably should have, but HashBackup is a static build and it requires baking all 3rd-party code into a customized Python interpreter that is used to build the final static executable. Based on the other, much smaller things I've done this with, adding NumPy would probably be quite a challenge.

Here's the simpler Nim program I would have liked to have been able to write:


import random
import algorithm

const
  els = 10_000_000
  maxval = 1_000_000

type E = object
  r: int32
  i: int32

proc main() =
  var
    a: array[els, E]
  
  for i in 0..high(a):
    a[i].r = int32(rand(maxval))
    a[i].i = int32(i)
  a.sort()
  echo a[0..<5], " ... ", a[^5..^1]

main()

I'm a neoNim so maybe something like that is possible. Still, even with a little extra complexity, I was able to put this Nim test together in a few minutes and the performance is great, whereas I spent a day or 2 hacking the Python code to get the memory usage down to 400M. And that's not even mentioning that it was obviously much slower to write a disk file, sort it, and read it back in (so much slower than this Python test program).

Great result for Nim!

Stefan_Salewski (orginal) [2020-07-04T17:51:06+02:00] view original

I would have liked to have been able to write:

What exactly was your problem?

The array? Well an array in Nim is a value object, it lives on the stack, so there is no pointer indirection involved. But for large arrays the default stack size can be too small. In that case we generally use sequences. Use of sequences are explained in all the fine tutorials, for performance you should create them with right size and avoid add(), use [] instead. My explanation was http://ssalewski.de/nimprogramming.html#_arrays_and_sequences, let me know what is unclear or about wrong grammar, I may continue in winter.

Yardanico (orginal) [2020-07-04T17:53:08+02:00] view original

Your "simpler" version can be made to work with a sequence (as Stefan said), and actually using a tuple instead of an object (since they have comparison operators defined):

import random
import algorithm

const
  els = 10_000_000
  maxval = 1_000_000

type E = tuple
  r: int32
  i: int32

proc main() =
  var a = newSeq[E](els)
  
  for i in 0..high(a):
    a[i].r = int32(rand(maxval))
    a[i].i = int32(i)
  a.sort()
  echo a[0..<5], " ... ", a[^5..^1]

main()

Yardanico (orginal) [2020-07-04T18:00:01+02:00] view original

Also, if you compile with -d:useMalloc and run under valgrind (I did it with ARC) you can clearly see this:


==30089== Memcheck, a memory error detector
==30089== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==30089== Using Valgrind-3.16.0.GIT and LibVEX; rerun with -h for copyright info
==30089== Command: ./b
==30089==
@[(r: 0, i: 861925), (r: 0, i: 1018886), (r: 0, i: 1112145), (r: 0, i: 3095791), (r: 0, i: 3127185)] ... @[(r: 1000000, i: 5801839), (r: 1000000, i: 6467043), (r: 1000000, i: 6768282), (r: 1000000, i: 7387067), (r: 1000000, i: 8454119)]
==30089==
==30089== HEAP SUMMARY:
==30089==     in use at exit: 0 bytes in 0 blocks
==30089==   total heap usage: 53 allocs, 53 frees, 120,003,383 bytes allocated
==30089==
==30089== All heap blocks were freed -- no leaks are possible
==30089==
==30089== For lists of detected and suppressed errors, rerun with: -s
==30089== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

HashBackupJim (orginal) [2020-07-04T19:21:16+02:00] view original

Thanks for the awesome tips, they're very helpful and much appreciated!

HashBackupJim (orginal) [2020-07-06T09:46:07+02:00] view original

Your version using sequences works well. I wondered how an array version of that would look, so tried using your sequence version but with a ref array, ie:


***************
*** 12,15 ****
  proc main() =
!   var a = newSeq[E](els)
!
    for i in 0..high(a):
--- 12,17 ----
  proc main() =
!   var
!     a: ref array[els, E]
!
!   new a
    for i in 0..high(a):

ms:nim jim$ nim c sort4
/Users/jim/nim/sort4.nim(17, 19) Error: type mismatch: got <ref array[0..9999999, E]>
but expected one of:
proc high(T: typedesc[SomeFloat]): T:type
  first type mismatch at position: 1
  required type for T: type SomeFloat
  but expression 'a' is of type: ref array[0..9999999, E]
proc high(x: cstring): int
  first type mismatch at position: 1
  required type for x: cstring
  but expression 'a' is of type: ref array[0..9999999, E]
proc high(x: string): int
  first type mismatch at position: 1
  required type for x: string
  but expression 'a' is of type: ref array[0..9999999, E]
proc high[I, T](x: array[I, T]): I
  first type mismatch at position: 1
  required type for x: array[I, T]
  but expression 'a' is of type: ref array[0..9999999, E]
proc high[I, T](x: typedesc[array[I, T]]): I
  first type mismatch at position: 1
  required type for x: type array[I, T]
  but expression 'a' is of type: ref array[0..9999999, E]
proc high[T: Ordinal | enum | range](x: T): T
  first type mismatch at position: 1
  required type for x: T: Ordinal or enum or range
  but expression 'a' is of type: ref array[0..9999999, E]
proc high[T: Ordinal | enum | range](x: typedesc[T]): T
  first type mismatch at position: 1
  required type for x: type T: Ordinal or enum or range
  but expression 'a' is of type: ref array[0..9999999, E]
proc high[T](x: openArray[T]): int
  first type mismatch at position: 1
  required type for x: openArray[T]
  but expression 'a' is of type: ref array[0..9999999, E]

expression: high(a)

I don't understand why high(a) works if the array isn't a ref array, but fails on a ref array. But to continue, I changed high(a) to ..< els. Then it didn't like sort:


/Users/jim/nim/sort4a.nim(20, 4) Error: type mismatch: got <ref array[0..9999999, E]>
but expected one of:
func sort[T](a: var openArray[T]; cmp: proc (x, y: T): int {.closure.};
            order = SortOrder.Ascending)
  first type mismatch at position: 1
  required type for a: var openArray[T]
  but expression 'a' is of type: ref array[0..9999999, E]
proc sort[T](a: var openArray[T]; order = SortOrder.Ascending)
  first type mismatch at position: 1
  required type for a: var openArray[T]
  but expression 'a' is of type: ref array[0..9999999, E]

expression: sort(a)

Shouldn't this change from array to ref array + new have worked?

Yardanico (orginal) [2020-07-06T09:50:08+02:00] view original

ref means that it's a managed reference, so you need to dereference it (see https://nim-lang.org/docs/manual.html#types-reference-and-pointer-types).

import random
import algorithm

const
  els = 10_000_000
  maxval = 1_000_000

type E = tuple
  r: int32
  i: int32

proc main() =
  var a: ref array[els, E]
  new(a)
  
  for i in 0..high(a[]):
    a[i].r = int32(rand(maxval))
    a[i].i = int32(i)
  a[].sort()
  echo a[0..<5], " ... ", a[^5..^1]

main()

HashBackupJim (orginal) [2020-07-06T15:27:55+02:00] view original

Thanks, I should have read more before asking and will try to do better on that.

I did see where the experimental implicitDeref pragma would have allowed the a.sort() syntax to work (I think). In the original version, without ref, a.high worked fine too. So I guess with this experimental pragma a.high and high(a) would also deref and work correctly. I'll switch to devel build today and play around with it. Seems like a good thing so that if an array blows through the stack and has to be moved to the heap, existing code referencing it doesn't have to be changed, though the doc comment says this only happens if the first arg is a ref array. Seems like it should happen anytime the array is referenced in a non-pointer context, but maybe that's hard for the compiler to determine.

Stefan_Salewski (orginal) [2020-07-06T19:11:51+02:00] view original

so that if an array blows through the stack and has to be moved to the heap

Note that array references are rarely used in real life, as there is no benefit over sequences. A seq is an object on the stack with contains a pointer to the actual data, and a array ref is a pointer on the stack that points to the data elements. So performance should be identical. Even performance advantage of a plain array to a seq should be not big -- seq has one indirection, but note when we work with the seq data, for example sorting it, then we load the start address of the data once in a register and never care again about details. Mratsim did a performance comparison array vs seq once, there was no difference.

I just noted that you are the Mr Wilcoxson who did the grammar fixes for my book, thanks again. But I have the feeling that I still have to explain some details better, maybe next winter, currently I am working on a GTK4 book.

doofenstein (orginal) [2020-07-06T19:58:26+02:00] view original

Note that array references are rarely used in real life, as there is no benefit over sequences.

this is not true, array references have the benefit that their size is known at compile time. It makes the intention of the code clearer and eliminates potential errors such as forgetting to initalise the seq to the right size.

Stefan_Salewski (orginal) [2020-07-06T20:44:30+02:00] view original

Yes, that is true. But can you give us some examples for large data blocks that have a fixed size known at compile time? I can remember only one, an array of floats with a size which had to be a power of two, so we could do a fast fourier transform. That was long time ago at university, where we still had an old windows 3.1 PC in the lab. Program was in C, I thing watson C compiler, I am not sure if it was a dynamically allocated array, or a plain global array.

doofenstein (orginal) [2020-07-11T21:39:39+02:00] view original

I actually remember one case where I used ref arrays. While implementing a slot maps container. I imagine there are other cases where you store data "chunked" like here and ref arrays come handy.

Mirror of forum.nim-lang.org

6518 :: NvP: creating and sorting a large array of tuples