nimforum mirror - strutils.parseEnum is slow.

Parashurama (orginal) [2017-05-08T14:05:12+02:00] view original

Recently I encountered a bottleneck while using parseEnum for some field inside a large CSV file.

parseEnum represented about 60% of the runtime, most of it due to repeated string allocation in a loop.

# strutils module.
proc parseEnum*[T: enum](s: string): T =
  ## Parses an enum ``T``.
  ##
  ## Raises ``ValueError`` for an invalid value in `s`. The comparison is
  ## done in a style insensitive way.
  for e in low(T)..high(T):
    if cmpIgnoreStyle(s, $e) == 0:
      return e
  raise newException(ValueError, "invalid enum value: " & s)

Since 'MyEnum' had many values, rather than hand-rolling a custom parseEnum proc, I wrote it using macros to autogenerate the case switch.

import macros

macro case_list(typ: typedesc[enum]; has_error: static[bool]): untyped =
    result = newStmtList()
    let cas = newNimNode(nnkCaseStmt)
    
    for node in getType(getTypeInst(typ)[1]):
        if node.kind  == nnkEmpty:
            cas.add ident"s"
        else:
            cas.add newNimNode(nnkOfBranch).add( newStrLitNode($node),
                newAssignment(ident"result", node)
            )
    
    if has_error:
        cas.add newNimNode(nnkElse).add(
            newNimNode(nnkRaiseStmt).add(
                newCall(ident("newException"), ident("ValueError"),
                    infix(newStrLitNode("invalid enum value: "), "&", ident("s"))
                )
            )
        )
    else:
        cas.add newNimNode(nnkElse).add(
            newAssignment(ident"result", ident("default"))
        )

proc parseEnum[T: enum](s: string): T =
    case_list(T, true)

proc parseEnum[T: enum](s: string; default: T): T =
    case_list(T, false)

It would be possible to replace the parseEnum proc in strutils by those above, but it would then import the macros module even if those function are not used. It also substantially slowdown nim compilation to c.

I think this is the right approach but would probably need to be implemented as compiler magic.

I have some idea how to do this, but would prefer an official okay to avoid wasting time.

def (orginal) [2017-05-08T14:53:19+02:00] view original

I had the same problem and a similar solution once: https://github.com/nim-lang/Nim/pull/2186

cdome (orginal) [2017-05-08T17:40:49+02:00] view original

Ops, I will be affected as well. In my case I will have lots of non consecutive enums hence it will be even slower.

@Araq, would you accept the fix with fields() and fieldPairs() magic iterators to start working on enums. I think it makes a lot of sense for non-ordinal enums to be able to iterate ids directly, not only low(T)..high(T) hitting lots of values that do not belong to the enum.

fields() # will iterate ints
fieldPairs() # will iterate (string, int) or (cstring, int) or (string{lit}, int)

The parseEnum implementation will become

proc parseEnum*[T: enum](s: string): T =
  ## Parses an enum ``T``.
  ##
  ## Raises ``ValueError`` for an invalid value in `s`. The comparison is
  ## done in a style insensitive way.
  for name, id in fieldPairs(T):
    if cmpIgnoreStyle(s, name) == 0:
      return id
  raise newException(ValueError, "invalid enum value: " & s)

and no string allocations.

Parashurama (orginal) [2017-05-08T20:09:58+02:00] view original

@def: ha I knew I had already seen some code that did this, but I couldn't remember where.

@cdome: That's a good idea. I have some half working version using RTTI hack but it isn't all that fast.

proc parseEnum[T: enum](s: string): T =
    let typ = cast[PNimType](getTypeInfo(T(0)))
    let n = typ.node
    var values = n.sons
    for i in 0..<n.len:
      #cmpIgnoreStyle(cstring, cstring)
      if cmpIgnoreStyle(s, values[i].name) == 0:
            return T(values[i].offset)

yglukhov (orginal) [2017-05-12T12:26:34+02:00] view original

As a side note, why do we need enum names in RTTI unconditionally?

Parashurama (orginal) [2017-05-13T07:46:03+02:00] view original

I think thats how $enum string are generated from.

from memory look for reprEnum proc

Araq (orginal) [2017-05-13T10:25:14+02:00] view original

would you accept the fix with fields() and fieldPairs() magic iterators to start working on enums. I think it makes a lot of sense for non-ordinal enums to be able to iterate ids directly, not only low(T)..high(T) hitting lots of values that do not belong to the enum.

Not sure I like it, with fieldPairs it's still O(N) where the string case is O(1).

cdome (orginal) [2017-05-13T21:57:29+02:00] view original

Hi Araq, could please elaborate "string case is O(1)" a bit more?

O(1) means hashtables to me, hence I presume something along these lines (not a final version):

import tables
import hashes

proc genTableForEnum[T: enum](): Table[string, T] =
  result = initTable[string, T](rightSize(ord(high(T)) - ord(low(T)) + 1))
  for i in low(T)..high(T):
      # TODO: skip holes in enum
      result[$i] = i


proc parseEnum*[T: enum](s: string): T =
  
  proc hash(s: string): Hash =
    # BUG: you can't mixin new Hash function into table
    result = hashIgnoreStyle(s)
  
  const enumTable = genTableForEnum[T]()
  result = enumTable[s]

This one looks really efficient except hash table is bigger than it could have been. I couldn't make tables module to work because I couldn't replace a standard hash with hashIgnoreStyle. I guess tables needs "mixin hash" is several places. Module gentabs did work, but it is deprecated.

Let me know your thoughts.

Araq (orginal) [2017-05-13T22:21:24+02:00] view original

O(1) is a hash table, yes. Nim's case str of ... construct uses hashing (for the C backend, unfortunately, that is not an AST to AST transformation).

cdome (orginal) [2017-05-13T22:29:45+02:00] view original

Would you accept the pull request with compile time hashtable implementation as in example I have provided? Of course in the finished state with handling of all edge cases, tests cases and stuff.

IMO, Parashurama's macro and internal Nim's string case transformation to hash table in order to handle cmpIgnoreCase would require extra allocation of temporary string.

cdome (orginal) [2017-05-14T22:50:46+02:00] view original

Ideal implementation would have hash and == inside parseEnum, but mixin stops working in this way. Looks like yet another bug

import tables
import hashes
import strutils

proc genTableForEnum[T: enum](): Table[string, T] =
  result = initTable[string, T](rightSize(ord(high(T)) - ord(low(T)) + 1))
  for i in items(T):
    # TODO: ignore holes
    result[$i] = i

proc hash(s: string): Hash {.inline.} =
  hashIgnoreStyle(s)

proc `==`(s1, s2: string): bool {.inline.} =
  cmpIgnoreStyle(s1, s2) == 0

proc parseEnum*[T: enum](s: string): T =
  const enumTable = genTableForEnum[T]()
  enumTable[s]

cdome (orginal) [2017-05-14T22:52:28+02:00] view original

Araq, could you please opine on this one?

Mirror of forum.nim-lang.org

2949 :: strutils.parseEnum is slow.