nimforum mirror - beat c in fefes "competition"

enthus1ast (orginal) [2019-03-24T18:15:44+01:00] view original

Hi,

fefe (a german blogger, http://blog.fefe.de/ ) has called for a little benchmark. The challenge is to count words in an webserver logfile (from stdin), then output them ordered like so:


150 foobaa
80  faa
12  www

Blog entries:

https://blog.fefe.de/?ts=a2689de5

https://blog.fefe.de/?ts=b7c295e7

Benchmarks:

http://ptrace.fefe.de/wp/timings2019.txt

https://ptrace.fefe.de/wp/timings.txt

Competing scripts:

https://ptrace.fefe.de/wp/

My idiomatic version:

## compile with: nim c -d:release --opt:speed --passL:-s wp_idiomatic.nim
import tables, strutils

proc main() =
  var countTable = newCountTable[string]()
  var line: string = ""
  while true:
    if not readLine(stdin, line): break
    for word in line.split():
      countTable.inc(word)
  countTable.sort()
  for key, val in countTable.pairs():
    echo val, " ", key
main()

for my testfile i appended a few nginx logs together:


292M	/tmp/bigger.log

timings of the c version:


time cat /tmp/bigger.log | ./a.out
real	0m2,373s
user	0m2,276s
sys	0m0,276s

timings of the nim idiomatic version:


time cat /tmp/bigger.log | ./wp_idiomatic
real	0m2,869s
user	0m2,766s
sys	0m0,280s

which i think is already a good speed (for this naive approach)?! So can one beat the c version? :D

miran (orginal) [2019-03-24T19:15:25+01:00] view original

So can one beat the c version?

Long time ago I concluded that when it comes to optimizations, I can optimize all day and night, but then at some point @mratsim will come and will improve it by 50+% :D

So if you want to beat C, basically wait for @mratsim to show up and do his magic :)

Stefan_Salewski (orginal) [2019-03-24T20:02:37+01:00] view original

It seems to be obvious that in your approach the problem is, that split iterator allocates a new string for each call. There may exist other, faster but uglier solutions. Basically reusing the same string. Was that available in parseutils? I can not remember, but I guess faster examples may be available somewhere. Rosetta, or that blog post about faster command line tools? Dom's book -- he talked about allocations, but his conclusion was not fully correct in the first draft. I wonder if iterators are possible which do not allocate the result?

arnetheduck (orginal) [2019-03-25T03:15:45+01:00] view original

Given that Nim compiles to C, then a C compiler compiles the resulting code, I always find these faster-than-C benchmark claims slightly amusing ;)

I'd venture ahead and say that it's mostly the skill of the developer and not so much the language that determines performance, by and large - that includes being able to find good idiomatic solutions to the problem in the language by using its native features appropriately.

mratsim (orginal) [2019-03-25T11:20:00+01:00] view original

Optimising strings is a pain :P. I don't have enough experience in that yet, but I plan to acquire it as I would need to do that for text analysis/natural language processing anyway.

cblake (orginal) [2019-03-25T12:02:00+01:00] view original

As @Stefan mentioned the string stuff can be slow. My MemSlice techniques might be more broadly interesting. The tokenizer below only works for "strict" one-character delimiting, not, e.g., repeated whitespace. Also, I agree completely with @arnetheduck that algorithm choices matter more than languages for Nim vs C comparisons.

That said, the below Nim is faster than the C version on my machine (215 ms for C, 176 ms for Nim). @enthus1ast can try that on his data and report back if he likes. There are several little defined switches to play with.

FWIW, the C version referenced is a rather lame one (fixed 64 KiEntry hash table with external chaining resolution). I didn't try profile-guided-optimization which could improve the Nim results some. Since the C version "cheats" a bit by pre-sizing its hash table, my version below takes any estimate on the number of unique words that will be found as input and pre-sizes the Nim table to fit that estimate.

Also, beware CountTable.sort(). It uses Shell's sort, not a fast algorithm. They're always integers and typically small. A radix sort would be the best choice, but regular old merge sort from algorithm is a lot better. On my test input with 291,000 unique words, the version using count.sort took 120 seconds instead of 176 milliseconds, almost 3 full orders of magnitude slower. Times seemed insensitive to hash functions (order 1% variations), though both the sorting and hash-sensitivity will be data set dependent, obviously. For very high entropy distributions you are going to want to avoid the current CountTable.sort(). Also, CountTable itself was a little slower than regular Table using the same sorting procedure. So, you might want to just avoid CountTable in general in peformance-sensitive cases.

import memfiles, hashes, tables, os, algorithm, strutils

proc hash*(ms: MemSlice): Hash {.inline.} =
  when defined(useNimHash):
    result = hashData(ms.data, ms.size)
  elif defined(useCB2):
    proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl,
            importc: "memhash_cb2", header: getHomeDir() & "s/cb/hash_cb2.c".}
    result = c_memhash(0, ms.data, ms.size)
  elif defined(useCB4):
    proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl,
            importc: "memhash_cb4", header: getHomeDir() & "s/cb/hash_cb4.c".}
    result = c_memhash(0, ms.data, ms.size)
  else:
    result = 0
    for i in 0 ..< ms.size:
      result = (result + (result shl 5)) xor int(cast[cstring](ms.data)[i])

proc split*(ms: MemSlice, fields: var seq[MemSlice], nA: int,
            sep=' ', size = -1): int =
  proc c_memchr(s: cstring, c: char, n: csize): cstring {.
       importc: "memchr", header: "<string.h>".}
  proc `+!`(p: cstring, i: int): cstring {.inline.} = cast[cstring](cast[int](p) +% i)
  proc `-!`(p, q: cstring): int {.inline.} = cast[int](p) -% cast[int](q)
  var b   = cast[cstring](ms.data)
  var eob = cast[cstring](ms.data) +! (if size != -1: size else: ms.size)
  var e   = c_memchr(b, sep, eob -! b)
  while e != nil:
    fields[result].data = cast[pointer](b)
    fields[result].size = e -! b
    result += 1
    if result == nA - 2:  #max ix nA-1; Need one more slot for final field
      break
    b = e +! 1
    e = c_memchr(b, sep, eob -! b)
  fields[result].data = cast[pointer](b)
  fields[result].size = eob -! b
  result += 1

proc main(input: string, sizeGuess: int) =
  const mxCols = 1024     #file format-dependent max
  when defined(useCountTab):
    var count = initCountTable[MemSlice](rightSize(sizeGuess))
  else:
    var count = initTable[MemSlice, int](rightSize(sizeGuess))
  var fields = newSeq[MemSlice](mxCols)
  for line in memSlices(memfiles.open(input)):
    fields.setLen(split(line, fields, mxCols, ' '))
    for word in fields:
      when defined(useCountTab):
        count.inc(word)
      else:
        inc(count.mgetOrPut(word, 0))
  stderr.write count.len, " unique words\n"
  when defined(useCountTab) and defined(useCountSort):
    count.sort()          #uses Shell's sort internally
    for key, cnt in count:
      stdout.write cnt, " "
      discard stdout.writeBuffer(cast[cstring](key.data), key.size)
      stdout.write "\n"
  else:
    var answer = newSeq[tuple[cnt: int, key: MemSlice]](count.len)
    var i = 0
    for key, cnt in count.pairs():
      answer[i].cnt = cnt
      answer[i].key = key
      inc(i)
    proc myCmp(x, y: tuple[cnt: int, key: MemSlice]): int = - cmp(x.cnt, y.cnt)
    answer.sort(myCmp)
    for ans in answer:
      stdout.write ans.cnt, " "
      discard stdout.writeBuffer(cast[cstring](ans.key.data), ans.key.size)
      stdout.write "\n"

#Must call with (seekable/mmapable) input file name and a size guess
main(paramStr(1), parseInt(paramStr(2)))

cblake (orginal) [2019-03-25T12:44:16+01:00] view original

Just a quick follow-up, with gcc-8.3 on Linux x86_64 Skylake CPU, profile-guided optimization did get that 176ms time down to 150 ms (With -d:useNimHash it was 152 ms), a 1.17x boost to 1.43x faster than the C in @enthus1ast 's wp.c.

Of course, with 291,000 unique keys the average external chain length in the C is going to be 291./65.5 =~ 4.4 which is, um, not great. So, maybe this is only faster than a strawman C impl. YMMV. If you really want that particular C impl to be slow, just run it with 1e6 unique words. ;-) (But then you really better not use Nim's current CountTable.sort!)

cblake (orginal) [2019-03-25T14:13:48+01:00] view original

Oh, and two more stats about my input data important to reason about my timings - 43 MB and 4.8e6 total words total (coincidentally close to 1e-3 times my 1/4.8GHz clock cycle).

So, average string length around 43e6/4.8e6 =~ 9 bytes (also a web server log), and about 150e-3/4.8e6 =~ 31 nsec =~ 150 CPU clock cycles per word.

A little more fiddling (commenting out various sections) shows that the 150 cycles per word breaks down to:


  22% parsing (no hash table or output, just the MemSlice stuff)
  55% hash table counting
  23% output (sorting + binary->ascii on the numbers).

So, 150 cycles per input word might sound like a lot, but I personally doubt it could be improved much. 43MB does not fit in my 8MB L3 cache, but 2**ceil(lg(291e3*9B)) =~ 512kEntry*16 just barely does assuming about 16B per entry. Really, 41B per hash slot is more realistic (8B for the hash code, maybe 16B more for MemSlice, 8B for the count and about 9B for the string data). The string data is indirect, and most lookups are failed lookups, so 32B is probably the appropriate number.

So, that 55% of 150 = 82 ms / 4.8e6 is =~ 17 ns ~ 80 cycles. The hash table is probably only about 50% L3 resident (16B/32B) - not bad, but not great. Quoted latency on Skylake L3 is supposedly about 40 cycles while my DIMMs are probably about 100 cycles (20ns). So, the (AFAIK unavoidable 4.8e6 lookup latencies) is some kind of average of 40 and 100 cycles. 80 cycles is not implausible. .5*100 + .5*40 = 70 cycles after all (and 100 may be..optimistic for my DIMMs). The table is only 56% full (291/512). So, probe sequences should be short and fit in one or two L3 cache lines, the 2nd of which is often preloaded by the hardware in a linear search. Of course, the table accesses are in dynamic competition with the streaming read from the parser because CPU designers sadly don't let user code control cache policies.

So, maybe some hand-rolled assembly could shave off 20 cycles in the hash phase and maybe 15..20 cycles in each of the other two phases for maybe 60 cycles better than the Nim's 150. 150/90 is another 1.6x of potential improvement. It's probably a ton of work to get that. Also, I haven't done it. So 60 cycles is just some nominal upper bound guess. 150 cycles might also be just about as good as can be done.

If the table cannot be kept 50% resident in L3, but more like 10% L3 resident then most of that "assumed 60 cycles" improvement would be lost just to the DIMM vs L3 latency. Meanwhile, just packing those hash entries into 16B from 32B (using e.g. a 32-bit hash code and counter and inline string data) could potentially make almost as much a difference in pure Nim by keeping it 100% L3 resident and maybe getting down to 40 cycles from 80. There is, of course, a hazard of focusing too much on the dimensions of this one input data set here which is unlikely to be very informative.

The "usual" informative way to benchmark hash tables is to do (at least) a "fits in L1" case to check code generation effects without a lot of memory system interference and then also a "doesn't fit in Lany" case to check latency/probe sequence effects. A more extreme variant of the latter would be a cold-cache spinning rust Winchester disk that doesn't fit in RAM, but such are usually painfully slow. Everything in between those two extremes gets a funky average of inter-cache-level behaviors that is harder to reason about. I figured I'd parse a web log like @enthus1ast. Generated synthetic data would be more reproducible, although also not exercise the hash function well. Sorry for the verbosity, but it's tricky to benchmark this stuff in a way that actually generalizes, and I've seen a lot of not so informative benchmarks get a lot of readership.

Araq (orginal) [2019-03-25T15:01:14+01:00] view original

Reusing the string used up for the word inside for word in line.split(): is rather easy to do and can be a stdlib patch. It's not hard and Nim's strings are awesome because allocations are expensive and mutability helps for re-using memory aggressively.

cblake (orginal) [2019-03-25T15:01:48+01:00] view original

I wrote: "most lookups are failed lookups", but Oops! 291e3 unique/4.8e6 total = 6% ==> 94% of lookups are successful. So, the memcmp that happens only after hash codes match does happen almost all the time, and so my particular data set is more like 16B/(32B+9B) = 39% cached, not 50%. This doesn't really alter the rest of my analysis, though since 39% and 50% are pretty close.

A corollary of this is that (for that style of data set), Robin Hood hashing w/linear probing would not help. RH can help for failed lookups because probe sequences for missing items become about 50% as long (once the search hash code is > table hash code you know it's missing and where to put the new key). It costs a little work { O(1.15x) } to keep entries along probe sequences sorted by hash code. So, RH only pays off if you care about worst case times or have a workload dominated by failed lookups over long probe sequences. At 56% table load (291./512) the sequences aren't long unless the hash function is weak relative to the data (not the case for me unless all 4 functions were equally weak).

Anyway, RH would be about the only algorithmic speed-up I think could help in the poorly cached large table case, but only for very low hit rate data. It would not be crazy to add a defined(useRobinHood) in the Nim hash table impl (or even have a per-instance run-time flag) since whether RH helps or hurts is so dependent upon workload.

Mirror of forum.nim-lang.org

4738 :: beat c in fefes "competition"