nimforum mirror - Getting max speed for an algorithm

kiwi (orginal) [2014-05-07T00:15:43+02:00] view original

I've got an algorithm I've been working on, basically a merge sort based inversion counter. It reads in a file, converts to int32 (for speed) and then sorts while counting inversions.

My original took 75ms (1 thread on 3.2G P5). After inlining the merge it dropped to 60ms and then globalizing the counter to avoid a tuple unpacking plus using

result[k] = L[i] instead of result.add(L[i])

reduced it to 40ms. So I have a couple of questions:

My observation is that result = A, copying an existing seq is faster than declaring var result and then newSeq(result, A.len) saving me a few ms. Am I wrong?

I couldn't figure out how to unpack tuples when returning a function call and as a result I suspect I incurred an extra copy of a large seq (hundreds of thousands of times).

type itup = tuple[aa:iseq, ii:int]

I would like to go

aa, ii = functionReturningItup()

rather than

itt = functionReturningItup() aa, ii = itt.aa, itt.ii

But I'm not sure if the second one is making copies or just a reference. Any help?

It annoys me that people are claiming 10-12ms for java while I can only get 40ms. Any thoughts on speeding up this code?

Note that the array (100,000 ints to start) must be returned along with inv the count. pastebin

Jehan (orginal) [2014-05-07T02:55:22+02:00] view original

Your implementation seems to mostly exercise the allocator and the garbage collector as you're allocating a couple hundred thousand subarrays. Just stopping the recursion at len(A) == 2 and avoiding allocating 100,000 one-element arrays seems to reduce runtime by about 1/3.

An in-place mergesort implementation should only need one other array (of the same length as the original) as its workspace.

Edit: Here's a quick and dirty implementation, not particularly optimized, sorts 100k elements in reverse order in .003 seconds on my machine. I didn't test if the number of inversions are right, but it seems to sort everything properly. Note that there's plenty of room for performance improvement left, I kept it simple.


import times, strutils

var inversions: int

proc merge(data, aux: var openarray[int32], start, nl, nr: int) =
  var lsize = nl
  var rsize = nr
  var lp = start
  var rp = start + nl
  var p = start
  while lsize > 0 and rsize > 0:
    if aux[lp] < aux[rp]:
      data[p] = aux[lp]
      p += 1; lp += 1; lsize -= 1
    else:
      data[p] = aux[rp]
      p += 1; rp += 1; rsize -= 1
      inversions += 1
  if lsize > 0:
    for i in 0..lsize-1:
      data[p+i] = aux[lp+i]
  else:
    for i in 0..rsize-1:
      data[p+i] = aux[rp+i]


proc mergesort(data, aux: var openarray[int32], start, count: int) =
  case count
  of 0, 1:
    discard
  of 2:
    if data[start] > data[start+1]:
      swap(data[start], data[start+1])
      inversions += 1
  else:
    var nl = count div 2
    var nr = count - nl
    if nl > 1:
      mergesort(data, aux, start, nl)
    if nr > 1:
      mergesort(data, aux, start+nl, nr)
    for i in 0..count-1:
      aux[start+i] = data[start+i]
    merge(data, aux, start, nl, nr)

proc mergesort(data: var openarray[int32]) =
  var aux = newSeq[int32](len(data))
  mergesort(data, aux, 0, len(data))

var m = newSeq[int32]()
var r = 100000
for i in 0..r-1:
  add(m, int32(r-i))
var tStart = epochTime()
mergesort(m)
var tEnd = epochTime()
echo formatFloat(tEnd-tStart, ffDecimal, 3)

Jehan (orginal) [2014-05-07T04:30:37+02:00] view original

This is probably also a good time as any to talk about the seq implementation. Sequences are Nimrod's dynamic arrays (i.e., ones that can be resized), and their internals are slightly different from what you may be used to in other languages.

Basically, a sequence is a reference to a a block of memory. That block of memory contains some header information (in particular, the length of the sequence), followed by the actual data. Basically:


.
              
              +--------+----------+
x: seq[T] --> | Header | Data ... |
              +--------+----------+

Most languages implement dynamic arrays slightly differently (if they have something like a Nimrod seq, it can generally not be resized). Arrays that follow Nimrod's layout generally cannot be resized. A dynamic array in those languages is a reference to an object, which in turn contains the length and a reference to the data. I.e., something like:


.
                   +--------+    +----------+
x: dynArray[T] --> | Header | -> | Data ... |
                   +--------+    +----------+

To avoid that, you need different types for dynamic and fixed-size arrays (e.g., what Java and C# do).

However, this comes at a slight cost. You cannot resize a Nimrod sequence without passing it as a var parameter (i.e., by reference). That's because in order to resize the sequence, you may have to copy the data to a new memory block and have to change the reference. Java etc. can avoid this because they have an additional level of indirection.

So, if you want to be able to fully modify everything about a Nimrod sequence, you have to pass it by reference (using var). That's still about efficient as other languages, as you just introduce the level of indirection temporarily that other languages always have for their dynamic arrays. If you just want to access the contents without needing to resize the sequence, it is generally more efficient to use openarray. This tells Nimrod that you won't need to resize it and allows you to modify the contents without adding the additional level of indirection that var introduces as a matter of necessity.

One other oddity of Nimrod sequences is that they are considered value types (strings also). That means that when you assign them to a variable (except when passing them as an argument to a proc or returning them as a result), they get copied. You can avoid these copying semantics by using shallow(s) to tell the runtime that a sequence or string s is to be treated as a reference.

Note: the above is to the best of my knowledge, Araq can correct any nonsense that I may have inadvertently written. :)

Varriount (orginal) [2014-05-07T04:34:46+02:00] view original

It might also be worth mentioning that the default GC is optimized for "responsiveness, not throughput". The other GC's, MarkAndSweep and the Boehm GC's. (Although it's worth knowing that I tested the original example using all GC's except for the Boehm, including using no GC, and the timeing for each was about the same.)

Jehan (orginal) [2014-05-07T04:47:37+02:00] view original

Varriount: It might also be worth mentioning that the default GC is optimized for "responsiveness, not throughput". The other GC's, MarkAndSweep and the Boehm GC's.

There are more tradeoffs, though. For example, the reference counting collector tends to perform better with large heaps (at least the last time I've tested it), because unless there are big cycles involved, it only needs to touch parts that are actually being used and can discard temporarily allocated memory quickly (somewhat like a generational collector). The main benefit of the Boehm GC is that it gives you a GCed concurrent heap, but it may require manual tuning if you use it for that.

gradha (orginal) [2014-05-07T08:39:26+02:00] view original

Tuple unpacking is mentioned in the tutorial, you should be able to write the assignment using parenthesis:


let (aa, ii) = functionReturningItup()

If you forget the parenthesis you will get two calls to functionReturningItup() instead.

kiwi (orginal) [2014-05-07T09:40:40+02:00] view original

Thanks for all the replies.

gradha ... thanks, I am sure I tried that and it didn't work ... but this time it did so I must have masked it with a different error (maybe not declaring the new variable). I just used var (i, j) = atest() and there it was.

jehan ... thanks for the very considered reply. For inversion counting you can't stop until length = 1 or ln = 0 in my case.

Thanks, I'm learning fast but have a way to go.

John

kiwi (orginal) [2014-05-07T09:57:20+02:00] view original

The sort is correct. And for small sample lists it got the inversion count right so onward. On my machine it takes 8.3 ms to do the run so ... wow !!! If I can only understand it and fix the slight count issue.

Q1. Why do you break the mergesort up by first doing this:

45 proc mergesort(data: var openarray[int32]) =
46   var aux = newSeq[int32](len(data))
47   mergesort(data, aux, 0, len(data))

Jehan (orginal) [2014-05-07T10:32:10+02:00] view original

Well, aux is a workspace area that is being used by merge(). When merging two subarrays of data, the segment is first copied over to aux and then the merge writes them from aux back to data.

This is inefficient because data gets copied twice as often as it needs to be, but has the virtue of simplicity.

An optimal solution merges directly from one array to other and back, switching them as need be, thus cutting the amount of copying in half. The recursion can also be eliminated, and the bulk copying at the end of merge() can be done more efficiently. You can also handle mostly sorted segments more efficiently by first comparing the last element of the left subarray to the first element of the right subarray to see if they are already in order.

Araq (orginal) [2014-05-07T11:09:34+02:00] view original

It would be interesting to benchmark this against algorithm.sort which implements a mergesort as well.

Jehan (orginal) [2014-05-07T18:18:32+02:00] view original

Araq: It would be interesting to benchmark this against ``algorithm.sort`` which implements a mergesort as well.

That would be a bit like apples and oranges, though. The one in algorithm has to deal with arbitrary comparison functions, while the one I threw together is optimized for integer comparisons (which are fast).

That said, having a variant in algorithm.sort that just uses the standard < ordering might be a good idea. This is what you need 90% of the time, anyway, and you'd avoid jumping through the cmp pointer for comparisons.

kiwi (orginal) [2014-05-08T00:17:09+02:00] view original

Jehan's merge sort (counting removed) is about 30% faster than the standard sort on my data with 1000 iterations.:

./w1_countInversions

Starting MergeSort 1000x  100000
@[54044, 14108, 79294, 29649, 25260, 60660]@[64399, 16774, 74018, 71187, 91901]
0 8.9091889858245850e+00s
@[1, 2, 3, 4, 5, 6]@[99996, 99997, 99998, 99999, 100000]

Starting QuickSort 1000x  100000
@[54044, 14108, 79294, 29649, 25260, 60660]@[64399, 16774, 74018, 71187, 91901]
0 1.1832314014434814e+01s
@[1, 2, 3, 4, 5, 6]@[99996, 99997, 99998, 99999, 100000]

Note that this applies to random arrays on 100,000+ elements. With 1000 elements the relative speeds are reversed and with 100 back to merge being faster.

I will take Jehan's suggestion and implement the array swap improvement and then report back.

kiwi (orginal) [2014-05-08T06:26:01+02:00] view original

I added an array swap by building a hashtable to determine which of data or alt should be processed. This was the fastest form and it was only very marginally faster than the basic strategy. Probably not worth implementing:

Pastebin

Mirror of forum.nim-lang.org

442 :: Getting max speed for an algorithm