nimforum mirror - naive implemnatation of KNN

darshanmeel (orginal) [2016-08-21T22:03:16+02:00] view original

Hi ,

I have implemented KNN in nim. as of now I am storing the distance for a test point from each train point which I wont keep in future as this is memory consuming. I am new to NIM. I am mor eof R and python guy and have faced serious issues with static typing.

Just want to understand what you guys think about the code.

Below is code

import csvtools
import times,macros
var fileName = "mtcars.csv"
type
  mtcars = object
    mpg : float64
    cyl : float64
    disp : float64
    hp : int
    drat : float64
    wt : float64
    qsec : float64
    vs : int
    am : int
    gear : int
    carb : int
const cols : seq[string] = @["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
echo cols
macro dif :stmt =
  result = newStmtList()
  var str : string
  for col in cols:
    str = "result." & col & "=x." & col & "-y." & col
    result.add parseStmt(str)

macro sqrm :stmt =
  result = newStmtList()
  var str : string
  for col in cols:
    str = "result." & col & "=x." & col & "*x." & col
    result.add parseStmt(str)

macro difsqrsmm :stmt =
  result = newStmtList()
  var str : string
  for col in cols:
    str = "result=result + float(x." & col & ")"
    result.add parseStmt(str)
  echo result.repr
proc `-` (x:mtcars,y: mtcars): mtcars =
  dif()
proc sqr (x:mtcars): mtcars =
  sqrm()
proc difsqrsm (x:mtcars): float =
  difsqrsmm()

proc distance (x:mtcars,y: mtcars): float =
  var dif = x - y
  dif = sqr(dif)
  result = difsqrsm(dif)

var
  mtrw : seq[mtcars] = @[]

for mtcar in csv[mtcars](fileName, skipHeader = true):
  #echo mtcar
  for i in 1..100_000:
    mtrw.add(mtcar)



echo mtrw.high
var test = mtrw[0..31]
var train = mtrw[32..mtrw.high]
mtrw = @[]
var dist : float
var k : int
k = 5

var eknn : seq[float]
var knn : seq[seq[float]] = @[]
let t0 = epochTime()
for tst in test:
  eknn = @[]
  for trn in train:
    dist = distance(tst, trn)
    eknn.add(dist)
  #echo eknn
  knn.add(eknn)
echo epochTime() - t0
#echo knn

Udiknedormin (orginal) [2016-08-21T23:55:04+02:00] view original

First of all, avoid using const sequences, use arrays if possible. Also, stmt is depreciated, use untyped instead.

You don't really need to store fields' names anywhere, actually. Nim's macros do code transformations at the compile time, when fields' names are known. Try it with:

echo treeRepr(x.getType)

where x is any variable you put into the macro. It will show you how the AST tree for the type.

How to utilize this information? As you run the above command on your type, you'll see it (and any object type, actually), has three children:

pragmas associated with the type (none here)

parent information (none here)

field information

Which means the information you need can be access through obj.getType[2]. Also, it means the macro can generate code for any object type, not just mtcars.

import strutils
macro dif(obj: typed): untyped =
  result = newStmtList()
  for col in obj.getType[2]:
    result.add parseStmt("result.$1 = x.$1 - y.$1" % [$col])

You can also pass a typedesc instead, then using typeName.getType will convert typedesc into a NimNode (BracketExpr, to be specific). It has two elements, the second of which is a Sym corresponding to the type. So you need typeName.getType[1].getType[2].

Macros can be also used to actually create the needed procedure so you don't have to write "proc - (x, y: A): B = dif(x)". You can do it either by using parseExpr/quote or by-hand, using newProc.

Of course I advice you to write a more universal macro so that it can generate any given code for all the given type's fields. Remember that the code is actually duplicated, not generalized.

darshanmeel (orginal) [2016-08-22T00:25:36+02:00] view original

Thanks a lot for your feedback.

I looked into the macros to get the type. It got bit messy but yes to make it more generic I will need to use macro.

I will be creating a macro which will generate all of this code.

As usual it will take time to use best method in Nim as I might not be aware of these :).

Hoover, with bit of direction I am sure I will bet

Thanks

Krux02 (orginal) [2016-08-22T02:46:46+02:00] view original

I think this is an unnecessary use of macros here. Here is my version without them.

type
  mtcars = object
    data : array[11,float64]

const cols : array[11,string] = ["mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]

echo(@cols)

proc `-` (x:mtcars,y: mtcars): mtcars =
  for i in low(result.data) .. high(result.data):
    result.data[i] = x.data[i] - y.data[i]

proc sqr (x:mtcars): mtcars =
  for i in low(result.data) .. high(result.data):
    result.data[i] = x.data[i] * x.data[i]

# this only computes sum, why don't you call it sum?
proc difsqrsm(x:mtcars): float =
  for i in low(x.data) .. high(x.data):
    result = result + x.data[i]

proc distance (x:mtcars,y: mtcars): float =
  var dif = x - y
  dif = sqr(dif)
  result = difsqrsm(dif)
  # this is the squared distance at this point, not the distance

unfortunately your csv reader depended on the layout with the named values and my change broke that. I am sorry that I don't have a solution for this at the moment. If you would have a good math library with vectors like those from e.g. Matlab or Eigen (c++) then you could use vector arithmetic and wouldn't even need to write a single loop. But I don't feel in position to recommend you a specific one from the current Nim options, I guess you would have to evaluate on your own. I changed your mtcars to contain float64 values only. I don't know if the int type is essential to you algorithm, I ignorantly assumend it is not (sorry if I might be wrong here).

perturbation2 (orginal) [2016-08-22T03:57:07+02:00] view original

If you would have a good math library with vectors like those from e.g. Matlab or Eigen (c++) then you could use vector arithmetic and wouldn't even need to write a single loop.

I would recommend the linalg library for that. It supports statically sized / dynamically sized vectors and matrices. I've used it for building the prototype of a project at work; it's missing some things (like dot product) that I had to add in myself, but it has CUDA support and support for basic operations (vector addition, subtraction, multiplication by a scalar, etc.).

mora (orginal) [2016-08-22T07:43:49+02:00] view original

It could be possible to get the column names from the csv itself. You can read files at compile time, and build your object type according to the first line.

Peter

def (orginal) [2016-08-22T08:37:56+02:00] view original

Do you have an example csv file to test the code with?

andrea (orginal) [2016-08-22T10:04:46+02:00] view original

@perturbation2 Dot product is certainly available: v * w, where v and w are vectors. :-)

andrea (orginal) [2016-08-22T10:07:46+02:00] view original

@mora It should be certainly possible to analyze not just the first one, but rather, say, the first 100 fields and try to deduce the field types from a CSV example. If anyone is going to do this, I would really like to see it inside https://github.com/unicredit/csvtools

mora (orginal) [2016-08-22T11:25:41+02:00] view original

@andrea: Yes, and we should learn from the big players, and see how Pandas and data.table (in R) deduce the types.

Edit: Moreover, we could have a common interface, e.g., data["hp"], where the "hp" could be translated to data.columns[3] (because "hp" is the forth column) at compile time. And if the csv file is not available at compile time (or you choose not to read the file at compile time), we could use a dictionary to do the same data["hp"] lookup at runtime.

Also, with some special flag, the data could be loaded at compile time making the binary independent of the csv file.

Krux02 (orginal) [2016-08-22T12:57:15+02:00] view original

reading data files at compile time, to be able to parse the exact same file at runtime just sounds pretty ugly.

mora (orginal) [2016-08-22T13:07:19+02:00] view original

@Krux02 Yes, it is ugly. But having a flag to do it at compile time instead of runtime sounds awesome :)

We had a topic here, where someone built a compile time built table from json, which eventually run on a microcontroller.

andrea (orginal) [2016-08-22T14:03:54+02:00] view original

Yes, it would be nice to have something like F# type providers

mora (orginal) [2016-08-22T14:58:22+02:00] view original

With support for iterators, so you could aggregate large datasets with no memory (in your link this is called disable cache).

darshanmeel (orginal) [2016-08-22T19:38:40+02:00] view original

Thanks a lot for the responses.

I wanted it to be generic.

. I don't know if the int type is essential to you algorithm, I ignorantly assumed it is not (sorry if I might be wrong here).

I can have strings as well there. In case of string, i will have a diff of 2 if strings are not same otherwise it will be 0. Your approach makes sense for me. Especially if I will use the linear algebra libraries. In Python, I used numpy for these.

I used the CSVtools as this gives you easy access to read the data from files.

It could be possible to get the column names from the csv itself. You can read files at compile time, and build your object type according to the first line.

I tried this as well but ended up messing up things with my limited knowledge of nim. In fact the csvtools gives you that option as well where you can get a seq[csvrows] and from there I guess you can create the type object.

I guess my issue is that I am thinking in python and then tries to implement in nim. I will need to develop my thinking in nim.

Thanks

Mirror of forum.nim-lang.org

2483 :: naive implemnatation of KNN