nimforum mirror - Sequence holding references to object with an attribute of generic type T

REQU (orginal) [2018-09-12T15:47:47+02:00] view original

Hi all,

I am learning Nim coming from Python and R as a data scientist. As I am a big proponent of learning by doing, I am, or trying to, implementing some small libraries.

My current project is a converter of deeply nested JSON object to DataFrame/flatfile format. I have a working python implementation but it is too slow for 'larger', 10k+ lines, JSON files.

The flatfile variant is intended to have rows that represent a level in the JSON. For this to work I need a object that can grow in size and hold mixed types.

My approach is to use a Row object that has an attribute elements that is a sequence. As, to the best of my knowledge, sequences cannot hold more that on type, I want to create a sequence that holds references to Element objects which have an attribute of a generic type.

Each element is essentialy a single JSON type (int, float, bool, string) and the name of the closest parent, i.e. dict key. My approach is inspired by (py)Spark row implementation, https://spark.apache.org/docs/1.1.1/api/python/pyspark.sql.Row-class.html.

However, I can't seem to get this working, I have tried making multiple type specific elements and a reference but I could not get the seq to accept a ref to those elements as a type.

Ideally, I would like to have something like below.:

type
  Element = ref ElementObj
  ElementObj*[T] = object
      parent: string
      data: T

type
  Row* = ref object
      columns: seq[string]
      elements: seq[Element]

proc newElement*[T](parent: string, data: T): ref ElementObj[T] =
    new(result)
    result.parent = parent
    result.data = data

var element = newElement("index", 12)
echo element.parent
echo element.data

# The above compiles, when including the below it does not

proc newRow*(): Row =
    new(result)
    result.columns = @[]
    result.elements = @[]

var row = newRow()

I get the following compile error:

Error: invalid type: 'ElementObj' in this context: 'proc (): Row' for proc

I have tried different variations and objects but I can't seem to get it to work. Help would be much appreciated. I am very open to changing my approach if this is impossible, inefficient or otherwise ill-advised.

Thanks,

Ralph

DeletedUser (orginal) [2018-09-12T17:15:10+02:00] view original

You cannot have a seq with both ElementObj[int] and ElementObj[string] in it, if that's what you're asking, but there are a few ways you can make it possible.

Firstly you can use object variants to your advantage. The implementation would be like:

type
  ElementKind* = enum
    ekInt, ekString
  Element = ref ElementObj
  ElementObj = object
      parent: string
      case kind: ElementKind
      of ekInt:
        intData: int
      of ekString:
        stringData: string

proc newElement(parent: string, data: int): Element =
    new(result)
    result.parent = parent
    result.kind = ekInt
    result.intData = data

proc newElement(parent: string, data: string): Element =
    new(result)
    result.parent = parent
    result.kind = ekString
    result.stringData = data

var element = newElement("index", 12)

The second one is using object inheritance, which isn't a very common approach but works for OOP purposes.

type
  Element = ref ElementObj
  ElementObj = object of RootObj
      parent: string
  
  StringElement = ref StringElementObj
  StringElementObj = object of ElementObj
    data: string
  
  IntElement = ref IntElementObj
  IntElementObj = object of ElementObj
    data: int

proc newElement(parent: string, data: int): IntElement =
    new(result)
    result.parent = parent
    result.data = data

proc newElement(parent: string, data: string): StringElement =
    new(result)
    result.parent = parent
    result.data = data

var element = newElement("index", 12)

Note that if you want dynamic dispatch you will have to use methods instead of procs. This is what I mean:

type
  Fruit = ref object of RootObj # needs to be a ref type for dynamic dispatch
  Apple = ref object of Fruit
  Pear = ref object of Fruit

method newElement(parent: string, fruit: Fruit) =
#proc newElement(parent: string, fruit: Fruit) =
  echo "fruit"

method newElement(parent: string, apple: Apple) =
#proc newElement(parent: string, apple: Apple) =
  echo "apple"

var fruit: Fruit
fruit = Apple()
newElement("index", fruit) # "fruit" if you used the proc keyword, "apple" if you used the method keyword

And finally, this one isn't limited to a few types but is unsafe, you can also use the RTTI-based typeinfo.Any. I haven't seen anyone actually use it though.

import typeinfo

type
  Element = ref ElementObj
  ElementObj = object
      parent: string
      data: Any

proc newElement(parent: string, data: Any): Element =
    new(result)
    result.parent = parent
    result.data = data

var num = 12
var element = newElement("index", num.toAny)

mratsim (orginal) [2018-09-12T17:22:14+02:00] view original

Hello fellow data scientist.

Your approach is not ideal but let's first correct your code.

This is the fixed one:

type
  Element[T] = ref ElementObj[T]
  ElementObj*[T] = object
      parent: string
      data: T

type
  Row*[T] = object
    columns: seq[string]
    elements: seq[Element[T]]

proc newElement*[T](parent: string, data: T): Element[T] =
    new(result)
    result.parent = parent
    result.data = data

var element = newElement("index", 12)
echo element.parent
echo element.data

# The above compiles, when including the below it does not

proc newRow*(T: typedesc): Row[T] =
  result.columns = @[]
  result.elements = @[]

var row = newRow(int)

# Output:
#   - index
#   - 12

Elements and Rows need the T marker for generics. Also if you have generic sequence say seq[Element[T]] you can hold either seq[Element[int]] or seq[Element[float]] but not both.

Why? Because at a low-level different types take different memory spaces. Python supports heterogeneous lists because each elements is hidden behind a pointer (32-bit size on 32 bit arch 64 bits on modern arch), we say that the types are boxed.

This is a form of type erasure

Here is an example:

import typetraits

type
  Element = ref object of RootObj
  ElementString = ref object of Element
    data: string
  ElementInt = ref object of Element
    data: int

type
  Row*[T] = object
    columns: seq[string]
    elements: seq[Element]

proc newElement*[T](parent: string, data: T): Element =
  when data is string:
    result = ElementString(data: data)
  elif data is int:
    result = ElementInt(data: data)
  else:
    {.fatal: "Unsupported type: " & T.name .}

proc initRow*(): Row =
  result.columns = @[]
  result.elements = @[]

method `$`(x: Element): string {.base.} =
  raise newException(ValueError, "Overload me!")

method `$`(x: ElementString): string =
  x.data

method `$`(x: ElementInt): string =
  $x.data

let element = newElement("index", 12)
let row = initRow()

echo element
echo row

# Output:
#   - 12
#   - (columns: @[], elements: @[])

The other form of type erasure in Nim is through object variants, also called tagged unions:

import typetraits

type
  ElementKind = enum
    ekString, ekint
  Element = object
    case kind: ElementKind
    of ekString:
      sData: string
    of ekInt:
      iData: int

type
  Row*
   = object
    columns: seq[string]
    elements: seq[Element]

proc initElement*[T](parent: string, data: T): Element =
  when data is string:
    result = Element(kind: ekString, sData: data)
  elif data is int:
    result = Element(kind: ekInt, iData: data)
  else:
    {.fatal: "Unsupported type: " & T.name .}

proc initRow*(): Row =
  result.columns = @[]
  result.elements = @[]

let element = initElement("index", 12)
let row = initRow()

echo element
echo row

# Output:
#   - (kind: ekint, iData: 12)
#   - (columns: @[], elements: @[])

Now on the differences:

As you can see the first kind (Boxing) uses ref and inheritance, ref means that data is allocated on the heap. Allocation is very expensive when done in a loop. The advantage is that, if you write a library that uses inheritance your types can be extended by the users.

For the second kind, object variants, this is allocated on stack so much faster but cannot be user extended without forking a library. Also while you always have to use a case statement to check the "tag"/"kind", branch predictors nowadays are very good at that and this is much less costly than memory accesses. The main issue is that ergonomically wise the fields are not named the same but you can always write a proc data wrapper that selects the proper field.

Now in conclusion, the json module already did half the work for you, just reuse the JsonNode type. Also be sure to check NimData.

REQU (orginal) [2018-09-12T22:43:51+02:00] view original

Hi @Hlaaftana, @mratsim,

Thank you for both the elaborate answers, really appreciated. I actually had a look object variants but was turned off by the different field names, but judging from your answers it seems that it is the best approach.

@Hlaaftana, the typeinfo.Any variant is interesting but perhaps not ideal.

@mratsim, I am using the json module but not very effectively it seems. I was planning on using NimData as final object and the rows to build the columns. Also, appreciate the mention of the performance costs of both approaches.

@both, thanks again. The whole no heterogeneous data structures takes a bit of change in mindset.

Ralf (orginal) [2018-09-13T07:40:10+02:00] view original

Help convert files psd to pdf. How best to do this? Online service will help to correctly convert https//onlineconvertfree.com/

gemath (orginal) [2018-09-14T09:53:54+02:00] view original

All the practical advice has been given, so here comes a general rant against how generic types are named and defined:

The OP's problem started here:

type
  Row* = ref object
    columns: seq[string]
    elements: seq[Element]      # <---- Problem

So many perfectly smart people who start using generics fall into this pit: Element is not a "concrete type". And it's no wonder that they fall, because the definition of Element looks like this:

type
  Element = ref ElementObj
  ElementObj*[T] = object
    parent: string
    data: T

It uses the type keyword, so ElementObj and, by extension, Element must be some kind of type, right? No, it isn't. It's the class of types which can be created by calling the type constructor ElementObj[] with all possible values of the type parameter T. Element is a type class, not a "generic type".

Nim should drop this confusing nomenclature inherited from C, call things what they are and introduce a typeclass keyword for "generic types", concepts and everything else which is not a "concrete type".

REQU (orginal) [2018-09-14T14:29:38+02:00] view original

Fair point, this 'trap' is probably more of an issue for people coming from dynamic languages and naive to think if it looks like a duck and called a duck is must be a duck.

Mirror of forum.nim-lang.org

4192 :: Sequence holding references to object with an attribute of generic type T