nimforum mirror - please give me advise for improve speed performance

bung (orginal) [2018-08-22T04:49:43+02:00] view original

I'm porting a python NLP module to nim,it results 4 times slower than python version.

It's for Chinese word segment task.

you can see the main test file(takes about 5 second):

https://github.com/bung87/finalseg/blob/master/tests/test2.nim

and python version test(takes about 1 second):

https://github.com/bung87/finalseg/blob/master/tests/speed.py

the original python module file:

https://github.com/fxsjy/jieba/blob/master/jieba/finalseg/__init__.py

nim :Nim Compiler Version 0.18.0 [MacOSX: amd64]

python:Python 3.6.5 (default, Jun 17 2018, 12:13:06)

cup:2.7 GHz Intel Core i5

ram:8 GB 1867 MHz DDR3

DeletedUser (orginal) [2018-08-22T18:35:17+02:00] view original

First off, compiling with the command line option -d:release always speeds up Nim code. Still though it's expected that Nim without release mode is faster than Python so I blame the nre module. Beyond that, here's some things I noticed in your code.

Let's go through the cut iterator that your code uses.

iterator cut*(sentence:string):string  =
    let blocks:seq[string] = filter(nre.split(sentence,re_han),proc(x: string): bool = x.len > 0)
    var
        tmp = newSeq[string]()
        wordStr:string
    for blk in blocks:
        if isSome(blk.match(re_han)) == true:
            for word in internal_cut(blk):
                wordStr = $word
                if (wordStr in Force_Split_Words == false):
                    yield wordStr
                else:
                    for c in wordStr:
                        yield $c
        else:
            tmp = filter(split(blk,re_skip),proc(x: string): bool = x.len > 0 or x.runeLen()>0)
            for x in tmp:
              yield x

You use filter and then iterate through the result right after twice here. Converting iterators to seqs is pretty expensive, so it's best to do all you can in 1 iteration.

iterator cut*(sentence: string): string =
  for blk in sentence.split(re_han):
    if blk.len == 0: continue
    if blk.match(re_han).isSome:
      for word in internal_cut(blk):
        let wordStr = $word
        if wordStr notin Force_Split_Words:
          yield wordStr
        else:
          for c in wordStr:
            yield $c
    else:
      for x in blk.split(re_skip):
        if x.len > 0 or x.runeLen > 0:
          yield x

This doesn't really improve performance, but I thought I'd include it anyway:

proc lcut*(sentence:string):seq[string] =
  result = lc[y | (y <- cut(sentence)),string ]

There is already a template in system.nim (the default imported module) for this purpose named accumulateResult. It's used like so:

proc lcut*(sentence: string): seq[string] =
  accumulateResult(cut(sentence))

But accumulateResult is deprecated in the devel branch, to our luck you can use sequtils.toSeq at your specific callsite:

for line in lines:
  discard lcut(line).join("/")

turns to:


# top of file
from sequtils import toSeq

for line in lines:
  discard toSeq(cut(line)).join("/")

This probably doesn't have much to do with the slowness, but you can optimize Table objects with char keys. Tables are currently implemented as a seq of tuple[hash, key, value], and since for chars key is the same thing as hash it would use 8 bytes more memory per entry. This might be optimized in a future version of nim, but for now this works:

proc getFromCharTable[V](charTable: openarray[(char, V)], key: char): V =
  for it in charTable:
    if it[0] == key:
      return it[1]

let foo = {'A': 1, 'B': 2} # the type is an array of (char, int)
echo foo.getFromCharTable('B') # 2

bung (orginal) [2018-08-22T19:44:51+02:00] view original

thanks Hlaaftana! I take the first advice in iterator, cleaner than before, about toSeq it doest seems to improve. and last one ,it's not familar to me in javascript or python,I usually use index directly access fields, besides the most large table is using string key as it store chinese word.the bottleneck maybe somewhere else I think.

bung (orginal) [2018-08-23T08:37:59+02:00] view original

I finally found that the most time costs is discard split(line,re_han) part, when I change the cut task to split, it takes 3 seconds....

twetzel59 (orginal) [2018-08-23T19:55:16+02:00] view original

@Hlaaftana1d, I might be completely off here about what you are suggesting, if so let me know.

It seems to me, though, that you are doing a linear search through the table for the correct key. With only 256 possibilities this isn't necessarily terrible, but there is another part to a typical hash table: key placement. If the keys are placed in a predictable location and collisions are handled, the table may have to only scan a few keys to find the correct value.

On the other, this adds significant complexity and can require allocations (for rehashing), not to mention one would have to benchmark to see if for such simply compared char keys there's an appreciable slowdown.

gemath (orginal) [2018-08-24T09:06:34+02:00] view original

In case you did not use profiling: this is a nice guide. Especially valgrind/cachegrind immediately points out time consuming procs.

Stefan_Salewski (orginal) [2018-08-24T09:24:32+02:00] view original

For Linux perf tools is also very nice -- I still wonder why Dom did not even mention it in his guide.

https://fedoramagazine.org/performance-profiling-perf/

Basic usage is

perf record yourNimExecutable && perf report

bung (orginal) [2018-08-24T10:25:49+02:00] view original

thank you ! yeah, I tried to use a profiling tool before, read the blog you mentioned, nimprof will give me a build error and valgrind or QCacheGrind is deprecated on mac. @Stefan_Salewski it difficult for me to reading kernel perf log. I think narrow it to Nim will be good.

Jehan (orginal) [2018-08-25T00:18:06+02:00] view original

On macOS, you can use Instruments (part of XCode, but it can also be used as a standalone app) to do time profiling.

On macOS Sierra or later, you can run it from the commandline using:


instruments -l 10000 -D output.trace -t "Time Profiler" /path/to/executable args...

(On older versions of macOS, the iprofiler command does the same thing, though with different options; for those, see iprofiler --help or man iprofiler. You can also run programs from the Instruments UI, but that is more cumbersome to set up.)

The -l option is for the maximum number of seconds the program may run (after that, it will be killed), the -D option specifies the output directory for the trace, and the -t option allows you to choose what kind of analysis to run.

Then open output.trace to be able to inspect the profile in the Instruments UI. You can see a sample screenshot of the UI with the output from the original code here. Don't forget to invert the call tree in the UI (using the button at the bottom of the window), it gives you a much more useful breakdown of where time is spent.

tim_st (orginal) [2018-08-27T21:55:09+02:00] view original

@bung You can replace all code like

p1 = if probRef.hasKey(vChar) : probRef.getOrDefault(vChar) else: MIN_FLOAT

p1 = probRef.getOrDefault(vChar, MIN_FLOAT)

bung (orginal) [2018-09-02T02:10:07+02:00] view original

thanks! got it

Mirror of forum.nim-lang.org

4147 :: please give me advise for improve speed performance