Had to do a quick script for the Caltech machine learning class. Quick and dirty code to implement a simple algorithm using generated data to answer problem set questions. I did pretty sketchy code in Python. Data set is not big enough to benefit from Numpy (tried and overhead swamped the benefits of vectorized calcs instead of looping).
After I was done I "ported" the code to Nim. Literally pasted the Python code into Sublime Text and starting hacking away. Replaced lists with sea's. Changed loops from in range(x) to in x .. y. Declared and typed all variables. Changing name of Python random lib to Nim's. It was all pretty easy. The result runs in about 1/30 the time of the Python code. That, of course, is not surprising. More delightful is that with just a little bit of familiarity with Nim's data types and idioms I can write code as easily as I (at least) write Python and the code is just as clear.
The code is below. Please don't lambaste me--it is junk thrown together late at night to get problem set done. Too many loops over the same stuff. Yes, it could be fixed. The point is I DID throw it together; it works; it runs nice and fast for what it is. The problem asked students to do a hundred runs and to try sample sizes of 10 and 100. For kicks, I did 1000 runs with a sample size of 1000. This took nearly 4 minutes in Python (macbook pro retina with i7 or i5? I forget...) and under 5 secs in Nim. The percent speed up for Nim increases a lot with larger sample size because of the nature of the perceptron (non-wonderful!) algorithm.
I would say that the only things I found a teensy bit awkward were:
2.Also, it seems that macros must slow the compiler way down. Being a lazy-butt I wanted to use Python format codes for the whopping whole 2 lines of output; ok, I just wanted to try it. So, downloaded the strfmt library and imported it. Apparently, the fmt function is implemented as a macro. Either the library itself is expensive to import or the macro I used is expensive. Adding this--the import and the function call on line 108 more than doubled the compilation time. Mind you, the whole time is still in the realm of unnoticeable at .4 or .5 second. Just surprised me that the small time doubled and wondered if it has to do with macros.
All in all, I was very pleased.
import math
import os
import strutils # use with fmtFloat to format individual value--using strfmt instead
import strfmt # to use Python style format codes
proc simper(bign: int): tuple[avg_iters, avg_disagree_pct : float ] =
## Simulate data for perceptron learning.
## Calculate the perceptron weights.
## Simulate a cross-validation data set and evaluate perceptron weights.
type
fpoint = tuple[x1: float, x2: float]
var
cnt: int = 0
disagree: int = 0
crossn: int
pick: int
x: seq[fpoint]
y: seq[float]
h: seq[float]
misclassified: seq[int]
w: array[0..2, float]
pa: fpoint
pb: fpoint
fx: float
slope: float
intercept: float
const
runs = 1000
randomize()
for k in 1 .. runs:
# create a cutting line for "true" f(x) based on 2 arbitrary points
pa = (random(2.0)-1.0, random(2.0)-1.0)
pb = (random(2.0)-1.0, random(2.0)-1.0)
slope = (pb[1] - pa[1]) / (pb[0] - pa[0])
intercept = pb[1] - pb[0] * slope
# echo k, " ", slope, " ", intercept
# create the simulated dataset
newseq(x, bign)
newseq(y, bign)
# label each sample of x with its "true" y label
for i in 0 .. bign-1:
x[i] = (random(2.0)-1.0, random(2.0)-1.0)
fx = slope * x[i][0] + intercept
y[i] = if x[i][1] >= fx: 1 else: -1
# echo x[i], " ", fx, " ", y[i]
# initial pla hypothesis with weights equal 0
w = [0.0, 0.0, 0.0]
newseq(h, bign)
for i in 0 .. bign-1:
h[i] = ( w[0]*1.0 + w[1]*x[i][0] + w[2]*x[i][1] )
# h.add( sum(w[0]*1.0, w[1]*x[i][0], w[2]*x[i][1]) )
# perform pla to determine g
while true:
misclassified = @[]
for i in 0 .. bign-1:
if (y[i] < 0) != (h[i] < 0) or h[i] == 0:
misclassified.add(i)
if len(misclassified) == 0:
# echo "****** converged"
break
else:
cnt += 1
if len(misclassified) == 1:
pick = misclassified[0]
else:
pick = random(len(misclassified)-1)
pick = misclassified[pick]
w[0] += y[pick]
w[1] += y[pick] * x[pick][0]
w[2] += y[pick] * x[pick][1]
h[pick] = w[0] * 1.0 + w[1] * x[pick][0] + w[2] * x[pick][1]
# h[pick] = sum( w[0]*1, w[1]*x[pick][0], w[2]*x[pick][1] )
# simulate a cross-validation set
crossn = 10 * bign
newseq(x, crossn) # prob faster than setting to empty (@[]) and adding 1
newseq(y, crossn) # element each time through loop???
randomize()
for i in 0 .. crossn-1:
x[i] = (random(2.0)-1.0, random(2.0)-1.0)
fx = slope * x[i][0] + intercept
y[i] = if x[i][1] >= fx: 1 else: -1 # "true" label for each x
# calculate hypothesis for cross validation set
newseq(h, crossn)
for i in 0 .. crossn-1:
h[i] = w[0] * 1.0 + w[1] * x[i][0] + w[2] * x[i][1]
for i in 0 .. crossn-1:
if (y[i] < 0) != (h[i] < 0) or h[i] == 0:
disagree += 1
# calculate summary stats for the entire run
var avg_iters = tofloat(cnt) / tofloat(runs)
var avg_disagree_pct = tofloat(disagree) / (tofloat(runs) * tofloat(crossn))
echo "{:.3f} {:.3f}".fmt(avg_iters, avg_disagree_pct)
return (avg_iters, avg_disagree_pct)
# run it and ignore the return value
var xt: int
if paramCount() < 1:
discard simper(10)
else:
try:
xt = parseInt(ParamStr(1))
except:
# let
# e = getCurrentException()
# msg = getCurrentExceptionMsg()
# echo "Got exception ", repr(e), " with message ", msg
quit("Input argument must be an integer. Exiting.")
discard simper(xt)
Yes.
Instead of multdim arrays I used seq's of tuples. I didn't need slicing and I doubt it would work given limitations of Nim slicing. But, access in loops is fine. Slicing would obviously be necessary for real work.
Are you doing a wrapper for Eigen? That seems like a good way to use a deeply tested library with vectorized calculations. Just need to invent a nice type for multidimensional arrays and overload operators to keep matrix algebra syntactically clean. Syntactically, Octave/Matlab looks good and the array literals make it easy to create new ones. Numpy is more awkward but once you get used to it it does what you need pretty quickly.
I'd be interested in helping out though my coding skills may not be up to what you are trying to do.
I probably won't have a first pass out for at least a few months (there are some compiler bugs blocking my design), but I'll post an announcement when an initial version is ready, and I'd welcome your feedback/bug reports.
it's just missing multidimensional arrays
Maybe you meant something different, but this works:
var a: array[0..2, array[0..1, int]]
a = [[11,12],[21,22],[31,32]]
var b = [[1,2,3],[11,22,33]]
# outputing some values for a test
echo a[0][1], " ", a[1][1], " ", a[2][0]
echo b[0][1], " ", b[0][2], " ", b[1][1]
lewisl: It's just a funny keyword that doesn't fit the rest of Nim. Not sure there is anything else to do; not suggesting anything.
For what it's worth, newSeq isn't actually a keyword; it's a procedure.
Demos: Unfortunately implementing really good multi dimensional arrays in a style like Eigen is not going to happen until nimrod's static[T] feature works.
You don't need static[T] for this particular use case, as ranges are types also. You can even use macros to get a nicer syntax, e.g.:
import macros
template makeMatrix(r1, r2: expr, el: typedesc): stmt =
type temp = array[range[r1], array[range[r2], el]]
macro matrix(r1, r2: expr, el: typedesc): auto =
var t = getAst(makeMatrix(r1, r2, el))
result = t.last.last.last
type M = matrix(1..10, 1..10, int)
Templates would work, too, except that templates can't actually return types, hence the workaround with getAst and stripping the extraneous stuff.
What Nim doesn't have that you do need, e.g., for matrix multiplication of non-square matrixes or multiplication, is higher-kinded types.
Jehan What Nim doesn't have that you do need, e.g., for matrix multiplication of non-square matrixes or multiplication, is higher-kinded types.
Looking through the documentation for the Eigen C++ library, I see a lot of template constant parameters but no template template parameters. For the use case of having array bounds checking for matrix multiplication, I don't see where higher kindedness comes in; you're just checking that for #columns of the left matrix equals #rows of the right one. What did you have in mind?
That said, I'm curious about how one would translate idiomatic template heavy C++-11 or D2 libraries to Nim. There are not only template constant params, but template template parameters and variadic template argument lists. Nim macros?
brianrogoff: For the use case of having array bounds checking for matrix multiplication, I don't see where higher kindedness comes in; you're just checking that for #columns of the left matrix equals #rows of the right one. What did you have in mind?
Let M[a,b] be an a-by-b matrix. Then we need a signature of the form:
proc `*`(m1: M[a, b], m2: M[b,c]): M[a,c]
To derive the return type generically, you need some form of type constructor as a parameter, i.e. higher-kinded types. (And yes, there are various ways to hack your way around it, none of them pretty.)
Varriount: Will do!
brianrogoff: For my particular case, I haven't had to use template template parameters, but I was planning to use tuple template parameters in place of variadic templates (which I think is slightly cleaner, anyway).
Jehan Let M[a,b] be an a-by-b matrix. Then we need a signature of the form:
proc `*`(m1: M[a, b], m2: M[b,c]): M[a,c]
I didn't spend a lot of time reading the Eigen code but it looks like matrices with known compile time sizes simply use template constant parameters to do the size checking. If a,b, & c in your pseudo-Nim (:-)) above are integers or ranges with values known at compile time, I would expect that Nin could do the same, with no constructor parameters.
To derive the return type generically, you need some form of type constructor as a parameter, i.e. higher-kinded types. (And yes, there are various ways to hack your way around it, none of them pretty.)
I must be dense, I still don't see why HKTs are so much better for this specific example than template constant params.
IMO, the case for HKTs (template template params in C++ parlance) are things like
template<class T> class Seq {
private:
std::shared_ptr<std::vector<T>> data;
};
template<class K, class V, template<typename> class C = Seq>
class Map {
C<K> key;
C<V> value;
};
where they truly are parameterizing over the constructor.
Are there any plans to add these to Nimrod? How about variadic generics?
brianrogoff: I must be dense, I still don't see why HKTs are so much better for this specific example than template constant params.
With constant parameters you can't parameterize over M above. Whether it's "better" depends on your needs, of course.