Hi all,
I am learning Nim coming from Python and R as a data scientist. As I am a big proponent of learning by doing, I am, or trying to, implementing some small libraries.
My current project is a converter of deeply nested JSON object to DataFrame/flatfile format. I have a working python implementation but it is too slow for 'larger', 10k+ lines, JSON files.
The flatfile variant is intended to have rows that represent a level in the JSON. For this to work I need a object that can grow in size and hold mixed types.
My approach is to use a Row object that has an attribute elements that is a sequence. As, to the best of my knowledge, sequences cannot hold more that on type, I want to create a sequence that holds references to Element objects which have an attribute of a generic type.
Each element is essentialy a single JSON type (int, float, bool, string) and the name of the closest parent, i.e. dict key. My approach is inspired by (py)Spark row implementation, https://spark.apache.org/docs/1.1.1/api/python/pyspark.sql.Row-class.html.
However, I can't seem to get this working, I have tried making multiple type specific elements and a reference but I could not get the seq to accept a ref to those elements as a type.
Ideally, I would like to have something like below.:
type Element = ref ElementObj ElementObj*[T] = object parent: string data: T type Row* = ref object columns: seq[string] elements: seq[Element] proc newElement*[T](parent: string, data: T): ref ElementObj[T] = new(result) result.parent = parent result.data = data var element = newElement("index", 12) echo element.parent echo element.data # The above compiles, when including the below it does not proc newRow*(): Row = new(result) result.columns = @[] result.elements = @[] var row = newRow()
I get the following compile error:
Error: invalid type: 'ElementObj' in this context: 'proc (): Row' for proc
I have tried different variations and objects but I can't seem to get it to work. Help would be much appreciated. I am very open to changing my approach if this is impossible, inefficient or otherwise ill-advised.
Thanks,
Ralph
You cannot have a seq with both ElementObj[int] and ElementObj[string] in it, if that's what you're asking, but there are a few ways you can make it possible.
Firstly you can use object variants to your advantage. The implementation would be like:
type
ElementKind* = enum
ekInt, ekString
Element = ref ElementObj
ElementObj = object
parent: string
case kind: ElementKind
of ekInt:
intData: int
of ekString:
stringData: string
proc newElement(parent: string, data: int): Element =
new(result)
result.parent = parent
result.kind = ekInt
result.intData = data
proc newElement(parent: string, data: string): Element =
new(result)
result.parent = parent
result.kind = ekString
result.stringData = data
var element = newElement("index", 12)
The second one is using object inheritance, which isn't a very common approach but works for OOP purposes.
type
Element = ref ElementObj
ElementObj = object of RootObj
parent: string
StringElement = ref StringElementObj
StringElementObj = object of ElementObj
data: string
IntElement = ref IntElementObj
IntElementObj = object of ElementObj
data: int
proc newElement(parent: string, data: int): IntElement =
new(result)
result.parent = parent
result.data = data
proc newElement(parent: string, data: string): StringElement =
new(result)
result.parent = parent
result.data = data
var element = newElement("index", 12)
Note that if you want dynamic dispatch you will have to use methods instead of procs. This is what I mean:
type
Fruit = ref object of RootObj # needs to be a ref type for dynamic dispatch
Apple = ref object of Fruit
Pear = ref object of Fruit
method newElement(parent: string, fruit: Fruit) =
#proc newElement(parent: string, fruit: Fruit) =
echo "fruit"
method newElement(parent: string, apple: Apple) =
#proc newElement(parent: string, apple: Apple) =
echo "apple"
var fruit: Fruit
fruit = Apple()
newElement("index", fruit) # "fruit" if you used the proc keyword, "apple" if you used the method keyword
And finally, this one isn't limited to a few types but is unsafe, you can also use the RTTI-based typeinfo.Any. I haven't seen anyone actually use it though.
import typeinfo
type
Element = ref ElementObj
ElementObj = object
parent: string
data: Any
proc newElement(parent: string, data: Any): Element =
new(result)
result.parent = parent
result.data = data
var num = 12
var element = newElement("index", num.toAny)
Hello fellow data scientist.
Your approach is not ideal but let's first correct your code.
This is the fixed one:
type
Element[T] = ref ElementObj[T]
ElementObj*[T] = object
parent: string
data: T
type
Row*[T] = object
columns: seq[string]
elements: seq[Element[T]]
proc newElement*[T](parent: string, data: T): Element[T] =
new(result)
result.parent = parent
result.data = data
var element = newElement("index", 12)
echo element.parent
echo element.data
# The above compiles, when including the below it does not
proc newRow*(T: typedesc): Row[T] =
result.columns = @[]
result.elements = @[]
var row = newRow(int)
# Output:
# - index
# - 12
Elements and Rows need the T marker for generics. Also if you have generic sequence say seq[Element[T]] you can hold either seq[Element[int]] or seq[Element[float]] but not both.
Why? Because at a low-level different types take different memory spaces. Python supports heterogeneous lists because each elements is hidden behind a pointer (32-bit size on 32 bit arch 64 bits on modern arch), we say that the types are boxed.
This is a form of type erasure
Here is an example:
import typetraits
type
Element = ref object of RootObj
ElementString = ref object of Element
data: string
ElementInt = ref object of Element
data: int
type
Row*[T] = object
columns: seq[string]
elements: seq[Element]
proc newElement*[T](parent: string, data: T): Element =
when data is string:
result = ElementString(data: data)
elif data is int:
result = ElementInt(data: data)
else:
{.fatal: "Unsupported type: " & T.name .}
proc initRow*(): Row =
result.columns = @[]
result.elements = @[]
method `$`(x: Element): string {.base.} =
raise newException(ValueError, "Overload me!")
method `$`(x: ElementString): string =
x.data
method `$`(x: ElementInt): string =
$x.data
let element = newElement("index", 12)
let row = initRow()
echo element
echo row
# Output:
# - 12
# - (columns: @[], elements: @[])
The other form of type erasure in Nim is through object variants, also called tagged unions:
import typetraits
type
ElementKind = enum
ekString, ekint
Element = object
case kind: ElementKind
of ekString:
sData: string
of ekInt:
iData: int
type
Row*
= object
columns: seq[string]
elements: seq[Element]
proc initElement*[T](parent: string, data: T): Element =
when data is string:
result = Element(kind: ekString, sData: data)
elif data is int:
result = Element(kind: ekInt, iData: data)
else:
{.fatal: "Unsupported type: " & T.name .}
proc initRow*(): Row =
result.columns = @[]
result.elements = @[]
let element = initElement("index", 12)
let row = initRow()
echo element
echo row
# Output:
# - (kind: ekint, iData: 12)
# - (columns: @[], elements: @[])
Now on the differences:
As you can see the first kind (Boxing) uses ref and inheritance, ref means that data is allocated on the heap. Allocation is very expensive when done in a loop. The advantage is that, if you write a library that uses inheritance your types can be extended by the users.
For the second kind, object variants, this is allocated on stack so much faster but cannot be user extended without forking a library. Also while you always have to use a case statement to check the "tag"/"kind", branch predictors nowadays are very good at that and this is much less costly than memory accesses. The main issue is that ergonomically wise the fields are not named the same but you can always write a proc data wrapper that selects the proper field.
Now in conclusion, the json module already did half the work for you, just reuse the JsonNode type. Also be sure to check NimData.
Hi @Hlaaftana, @mratsim,
Thank you for both the elaborate answers, really appreciated. I actually had a look object variants but was turned off by the different field names, but judging from your answers it seems that it is the best approach.
@Hlaaftana, the typeinfo.Any variant is interesting but perhaps not ideal.
@mratsim, I am using the json module but not very effectively it seems. I was planning on using NimData as final object and the rows to build the columns. Also, appreciate the mention of the performance costs of both approaches.
@both, thanks again. The whole no heterogeneous data structures takes a bit of change in mindset.
All the practical advice has been given, so here comes a general rant against how generic types are named and defined:
The OP's problem started here:
type
Row* = ref object
columns: seq[string]
elements: seq[Element] # <---- Problem
So many perfectly smart people who start using generics fall into this pit: Element is not a "concrete type". And it's no wonder that they fall, because the definition of Element looks like this:
type
Element = ref ElementObj
ElementObj*[T] = object
parent: string
data: T
It uses the type keyword, so ElementObj and, by extension, Element must be some kind of type, right? No, it isn't. It's the class of types which can be created by calling the type constructor ElementObj[] with all possible values of the type parameter T. Element is a type class, not a "generic type".
Nim should drop this confusing nomenclature inherited from C, call things what they are and introduce a typeclass keyword for "generic types", concepts and everything else which is not a "concrete type".