Hi ,
I have implemented KNN in nim. as of now I am storing the distance for a test point from each train point which I wont keep in future as this is memory consuming. I am new to NIM. I am mor eof R and python guy and have faced serious issues with static typing.
Just want to understand what you guys think about the code.
Below is code
import csvtools
import times,macros
var fileName = "mtcars.csv"
type
mtcars = object
mpg : float64
cyl : float64
disp : float64
hp : int
drat : float64
wt : float64
qsec : float64
vs : int
am : int
gear : int
carb : int
const cols : seq[string] = @["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
echo cols
macro dif :stmt =
result = newStmtList()
var str : string
for col in cols:
str = "result." & col & "=x." & col & "-y." & col
result.add parseStmt(str)
macro sqrm :stmt =
result = newStmtList()
var str : string
for col in cols:
str = "result." & col & "=x." & col & "*x." & col
result.add parseStmt(str)
macro difsqrsmm :stmt =
result = newStmtList()
var str : string
for col in cols:
str = "result=result + float(x." & col & ")"
result.add parseStmt(str)
echo result.repr
proc `-` (x:mtcars,y: mtcars): mtcars =
dif()
proc sqr (x:mtcars): mtcars =
sqrm()
proc difsqrsm (x:mtcars): float =
difsqrsmm()
proc distance (x:mtcars,y: mtcars): float =
var dif = x - y
dif = sqr(dif)
result = difsqrsm(dif)
var
mtrw : seq[mtcars] = @[]
for mtcar in csv[mtcars](fileName, skipHeader = true):
#echo mtcar
for i in 1..100_000:
mtrw.add(mtcar)
echo mtrw.high
var test = mtrw[0..31]
var train = mtrw[32..mtrw.high]
mtrw = @[]
var dist : float
var k : int
k = 5
var eknn : seq[float]
var knn : seq[seq[float]] = @[]
let t0 = epochTime()
for tst in test:
eknn = @[]
for trn in train:
dist = distance(tst, trn)
eknn.add(dist)
#echo eknn
knn.add(eknn)
echo epochTime() - t0
#echo knn
First of all, avoid using const sequences, use arrays if possible. Also, stmt is depreciated, use untyped instead.
You don't really need to store fields' names anywhere, actually. Nim's macros do code transformations at the compile time, when fields' names are known. Try it with:
echo treeRepr(x.getType)
where x is any variable you put into the macro. It will show you how the AST tree for the type.
How to utilize this information? As you run the above command on your type, you'll see it (and any object type, actually), has three children:
Which means the information you need can be access through obj.getType[2]. Also, it means the macro can generate code for any object type, not just mtcars.
import strutils
macro dif(obj: typed): untyped =
result = newStmtList()
for col in obj.getType[2]:
result.add parseStmt("result.$1 = x.$1 - y.$1" % [$col])
You can also pass a typedesc instead, then using typeName.getType will convert typedesc into a NimNode (BracketExpr, to be specific). It has two elements, the second of which is a Sym corresponding to the type. So you need typeName.getType[1].getType[2].
Macros can be also used to actually create the needed procedure so you don't have to write "proc - (x, y: A): B = dif(x)". You can do it either by using parseExpr/quote or by-hand, using newProc.
Of course I advice you to write a more universal macro so that it can generate any given code for all the given type's fields. Remember that the code is actually duplicated, not generalized.
Thanks a lot for your feedback.
I looked into the macros to get the type. It got bit messy but yes to make it more generic I will need to use macro.
I will be creating a macro which will generate all of this code.
As usual it will take time to use best method in Nim as I might not be aware of these :).
Hoover, with bit of direction I am sure I will bet
Thanks
I think this is an unnecessary use of macros here. Here is my version without them.
type
mtcars = object
data : array[11,float64]
const cols : array[11,string] = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
echo(@cols)
proc `-` (x:mtcars,y: mtcars): mtcars =
for i in low(result.data) .. high(result.data):
result.data[i] = x.data[i] - y.data[i]
proc sqr (x:mtcars): mtcars =
for i in low(result.data) .. high(result.data):
result.data[i] = x.data[i] * x.data[i]
# this only computes sum, why don't you call it sum?
proc difsqrsm(x:mtcars): float =
for i in low(x.data) .. high(x.data):
result = result + x.data[i]
proc distance (x:mtcars,y: mtcars): float =
var dif = x - y
dif = sqr(dif)
result = difsqrsm(dif)
# this is the squared distance at this point, not the distance
unfortunately your csv reader depended on the layout with the named values and my change broke that. I am sorry that I don't have a solution for this at the moment. If you would have a good math library with vectors like those from e.g. Matlab or Eigen (c++) then you could use vector arithmetic and wouldn't even need to write a single loop. But I don't feel in position to recommend you a specific one from the current Nim options, I guess you would have to evaluate on your own. I changed your mtcars to contain float64 values only. I don't know if the int type is essential to you algorithm, I ignorantly assumend it is not (sorry if I might be wrong here).
If you would have a good math library with vectors like those from e.g. Matlab or Eigen (c++) then you could use vector arithmetic and wouldn't even need to write a single loop.
I would recommend the linalg library for that. It supports statically sized / dynamically sized vectors and matrices. I've used it for building the prototype of a project at work; it's missing some things (like dot product) that I had to add in myself, but it has CUDA support and support for basic operations (vector addition, subtraction, multiplication by a scalar, etc.).
It could be possible to get the column names from the csv itself. You can read files at compile time, and build your object type according to the first line.
Peter
@andrea: Yes, and we should learn from the big players, and see how Pandas and data.table (in R) deduce the types.
Edit: Moreover, we could have a common interface, e.g., data["hp"], where the "hp" could be translated to data.columns[3] (because "hp" is the forth column) at compile time. And if the csv file is not available at compile time (or you choose not to read the file at compile time), we could use a dictionary to do the same data["hp"] lookup at runtime.
Also, with some special flag, the data could be loaded at compile time making the binary independent of the csv file.
@Krux02 Yes, it is ugly. But having a flag to do it at compile time instead of runtime sounds awesome :)
We had a topic here, where someone built a compile time built table from json, which eventually run on a microcontroller.
Thanks a lot for the responses.
I wanted it to be generic.
. I don't know if the int type is essential to you algorithm, I ignorantly assumed it is not (sorry if I might be wrong here).
I can have strings as well there. In case of string, i will have a diff of 2 if strings are not same otherwise it will be 0. Your approach makes sense for me. Especially if I will use the linear algebra libraries. In Python, I used numpy for these.
I used the CSVtools as this gives you easy access to read the data from files.
It could be possible to get the column names from the csv itself. You can read files at compile time, and build your object type according to the first line.
I tried this as well but ended up messing up things with my limited knowledge of nim. In fact the csvtools gives you that option as well where you can get a seq[csvrows] and from there I guess you can create the type object.
I guess my issue is that I am thinking in python and then tries to implement in nim. I will need to develop my thinking in nim.
Thanks