nimforum mirror - Well, I'm pleased

lewisl (orginal) [2014-10-15T04:04:41+02:00] view original

Had to do a quick script for the Caltech machine learning class. Quick and dirty code to implement a simple algorithm using generated data to answer problem set questions. I did pretty sketchy code in Python. Data set is not big enough to benefit from Numpy (tried and overhead swamped the benefits of vectorized calcs instead of looping).

After I was done I "ported" the code to Nim. Literally pasted the Python code into Sublime Text and starting hacking away. Replaced lists with sea's. Changed loops from in range(x) to in x .. y. Declared and typed all variables. Changing name of Python random lib to Nim's. It was all pretty easy. The result runs in about 1/30 the time of the Python code. That, of course, is not surprising. More delightful is that with just a little bit of familiarity with Nim's data types and idioms I can write code as easily as I (at least) write Python and the code is just as clear.

The code is below. Please don't lambaste me--it is junk thrown together late at night to get problem set done. Too many loops over the same stuff. Yes, it could be fixed. The point is I DID throw it together; it works; it runs nice and fast for what it is. The problem asked students to do a hundred runs and to try sample sizes of 10 and 100. For kicks, I did 1000 runs with a sample size of 1000. This took nearly 4 minutes in Python (macbook pro retina with i7 or i5? I forget...) and under 5 secs in Nim. The percent speed up for Nim increases a lot with larger sample size because of the nature of the perceptron (non-wonderful!) algorithm.

I would say that the only things I found a teensy bit awkward were:

newseq to allocate a seq. Yes you need it. It is likely faster to pre-allocate what you need than to add a slot at a time. It's just a funny keyword that doesn't fit the rest of Nim. Not sure there is anything else to do; not suggesting anything.

2.Also, it seems that macros must slow the compiler way down. Being a lazy-butt I wanted to use Python format codes for the whopping whole 2 lines of output; ok, I just wanted to try it. So, downloaded the strfmt library and imported it. Apparently, the fmt function is implemented as a macro. Either the library itself is expensive to import or the macro I used is expensive. Adding this--the import and the function call on line 108 more than doubled the compilation time. Mind you, the whole time is still in the realm of unnoticeable at .4 or .5 second. Just surprised me that the small time doubled and wondered if it has to do with macros.

I couldn't get the sum function to work on multiple arguments or a tuple of arguments. Probably something I was doing wrong. Only 3 terms so I just did x + y + z. But, would have been nice to use sum as real problem would have many more things to sum: this is typical when implementing matrix algebra in loops. Certainly, something I was doing wrong. (lines 82 and 83)

All in all, I was very pleased.


import math
import os
import strutils  # use with fmtFloat to format individual value--using strfmt instead
import strfmt  # to use Python style format codes

proc simper(bign: int): tuple[avg_iters, avg_disagree_pct : float ] =
    ## Simulate data for perceptron learning.
    ## Calculate the perceptron weights.
    ## Simulate a cross-validation data set and evaluate perceptron weights.
    
    type
        fpoint = tuple[x1: float, x2: float]
    
    var
        cnt:            int = 0
        disagree:       int = 0
        crossn:         int
        pick:           int
        x:              seq[fpoint]
        y:              seq[float]
        h:              seq[float]
        misclassified:  seq[int]
        w:              array[0..2, float]
        pa:             fpoint
        pb:             fpoint
        fx:             float
        slope:          float
        intercept:      float
    
    const
        runs = 1000
    
    randomize()
    for k in 1 .. runs:
        # create a cutting line for "true" f(x) based on 2 arbitrary points
        pa = (random(2.0)-1.0, random(2.0)-1.0)
        pb = (random(2.0)-1.0, random(2.0)-1.0)
        slope = (pb[1] - pa[1]) / (pb[0] - pa[0])
        intercept = pb[1] - pb[0] * slope
        # echo k, " ", slope, " ", intercept
        
        # create the simulated dataset
        newseq(x, bign)
        newseq(y, bign)
        
        # label each sample of x with its "true" y label
        for i in 0 .. bign-1:
            x[i] = (random(2.0)-1.0, random(2.0)-1.0)
            fx = slope * x[i][0] + intercept
            y[i] = if x[i][1] >= fx: 1  else: -1
            # echo x[i], " ", fx, " ", y[i]
        
        # initial pla hypothesis with weights equal 0
        w = [0.0, 0.0, 0.0]
        newseq(h, bign)
        for i in 0 .. bign-1:
            h[i] = (  w[0]*1.0 + w[1]*x[i][0] + w[2]*x[i][1] )
            # h.add(  sum(w[0]*1.0, w[1]*x[i][0], w[2]*x[i][1])  )
        
        # perform pla to determine g
        while true:
            misclassified = @[]
            for i in 0 .. bign-1:
                if (y[i] < 0) != (h[i] < 0) or h[i] == 0:
                    misclassified.add(i)
            
            if len(misclassified) == 0:
                # echo "****** converged"
                break
            else:
                cnt += 1
                if len(misclassified) == 1:
                    pick = misclassified[0]
                else:
                    pick = random(len(misclassified)-1)
                    pick = misclassified[pick]
                
                w[0] += y[pick]
                w[1] += y[pick] * x[pick][0]
                w[2] += y[pick] * x[pick][1]
                
                h[pick] = w[0] * 1.0 + w[1] * x[pick][0] + w[2] * x[pick][1]
                # h[pick] = sum( w[0]*1, w[1]*x[pick][0], w[2]*x[pick][1] )
        
        # simulate a cross-validation set
        crossn = 10 * bign
        newseq(x, crossn)  # prob faster than setting to empty (@[]) and adding 1
        newseq(y, crossn)  # element each time through loop???
        
        randomize()
        for i in 0 .. crossn-1:
            x[i] = (random(2.0)-1.0, random(2.0)-1.0)
            fx = slope * x[i][0] + intercept
            y[i] = if x[i][1] >= fx: 1  else: -1  # "true" label for each x
        
        # calculate hypothesis for cross validation set
        newseq(h, crossn)
        for i in 0 .. crossn-1:
            h[i] =  w[0] * 1.0 + w[1] * x[i][0] + w[2] * x[i][1]
        
        for i in 0 .. crossn-1:
            if (y[i] < 0) != (h[i] < 0) or h[i] == 0:
                disagree += 1
    
    # calculate summary stats for the entire run
    var avg_iters = tofloat(cnt) / tofloat(runs)
    var avg_disagree_pct = tofloat(disagree) / (tofloat(runs) * tofloat(crossn))
    echo "{:.3f}  {:.3f}".fmt(avg_iters, avg_disagree_pct)
    return (avg_iters, avg_disagree_pct)

# run it and ignore the return value
var xt: int
if paramCount() < 1:
    discard simper(10)
else:
    try:
        xt = parseInt(ParamStr(1))
    except:
        # let
        #     e = getCurrentException()
        #     msg = getCurrentExceptionMsg()
        # echo "Got exception ", repr(e), " with message ", msg
        quit("Input argument must be an integer. Exiting.")
    discard simper(xt)

mason_mcgill (orginal) [2014-10-15T04:29:16+02:00] view original

This is Yasser's course, isn't it :) ? (I'm a graduate student at Caltech.) Nim seems like a nice fit for that course--succinct, efficient, strongly typed--it's just missing multidimensional arrays (and I'm working on fixing that :).

mason_mcgill (orginal) [2014-10-15T04:29:51+02:00] view original

Didn't realize the smileys would come out so creepy-looking... sorry : (

lewisl (orginal) [2014-10-15T07:23:39+02:00] view original

Yes.

Instead of multdim arrays I used seq's of tuples. I didn't need slicing and I doubt it would work given limitations of Nim slicing. But, access in loops is fine. Slicing would obviously be necessary for real work.

Are you doing a wrapper for Eigen? That seems like a good way to use a deeply tested library with vectorized calculations. Just need to invent a nice type for multidimensional arrays and overload operators to keep matrix algebra syntactically clean. Syntactically, Octave/Matlab looks good and the array literals make it easy to create new ones. Numpy is more awkward but once you get used to it it does what you need pretty quickly.

I'd be interested in helping out though my coding skills may not be up to what you are trying to do.

mason_mcgill (orginal) [2014-10-15T18:43:59+02:00] view original

I'm porting a lazy multidimensional array library that I wrote in D. It doesn't support linear algebra, but could be used as the basis for a linear algebra library (pun not intended). It focuses on supporting high-dimensional signal processing, with features like...

Lazy elementwise n-ary operations.

Lazy reduction operations.

Lazy convolution and resampling.

Pervasive implicit broadcasting (like NumPy).

Strong typing (like Julia's Array API).

Type classes analogous to those in D's Range API.

I probably won't have a first pass out for at least a few months (there are some compiler bugs blocking my design), but I'll post an announcement when an initial version is ready, and I'd welcome your feedback/bug reports.

lewisl (orginal) [2014-10-16T02:20:48+02:00] view original

Ambitious to write from scratch (or port what you've already written in D). A benefit is Nim semantics throughout and source visible to folks who understand Nim. A potential disadvantage is that numerical code needs to be trusted. By starting with one of the widely used, OLD, and massively debugged numerical libraries you start out with some trust that results are trustworthy.

LeuGim (orginal) [2014-10-16T15:02:50+02:00] view original

it's just missing multidimensional arrays

Maybe you meant something different, but this works:

var a: array[0..2, array[0..1, int]]
a = [[11,12],[21,22],[31,32]]
var b = [[1,2,3],[11,22,33]]
# outputing some values for a test
echo a[0][1], " ", a[1][1], " ", a[2][0]
echo b[0][1], " ", b[0][2], " ", b[1][1]

Demos (orginal) [2014-10-16T17:47:35+02:00] view original

Unfortunately implementing really good multi dimensional arrays in a style like Eigen is not going to happen until nimrod's static[T] feature works. I think dynamically sized multi dimensional arrays may work though.

Jehan (orginal) [2014-10-16T18:46:13+02:00] view original

lewisl: It's just a funny keyword that doesn't fit the rest of Nim. Not sure there is anything else to do; not suggesting anything.

For what it's worth, newSeq isn't actually a keyword; it's a procedure.

Jehan (orginal) [2014-10-16T18:51:05+02:00] view original

Demos: Unfortunately implementing really good multi dimensional arrays in a style like Eigen is not going to happen until nimrod's static[T] feature works.

You don't need static[T] for this particular use case, as ranges are types also. You can even use macros to get a nicer syntax, e.g.:


import macros

template makeMatrix(r1, r2: expr, el: typedesc): stmt =
  type temp = array[range[r1], array[range[r2], el]]

macro matrix(r1, r2: expr, el: typedesc): auto =
  var t = getAst(makeMatrix(r1, r2, el))
  result = t.last.last.last

type M = matrix(1..10, 1..10, int)

Templates would work, too, except that templates can't actually return types, hence the workaround with getAst and stripping the extraneous stuff.

What Nim doesn't have that you do need, e.g., for matrix multiplication of non-square matrixes or multiplication, is higher-kinded types.

mason_mcgill (orginal) [2014-10-16T19:30:40+02:00] view original

lewisl: Point taken. Widely-used numerical code definitely has to be well-tested. I'm writing this library primarily to enable my research (there are certain patterns for writing efficient, functional code that Python, MATLAB, and even Julia aren't good at expressing, and I'd rather not write in C++ for my entire PhD). If--when I make it public--others are interested, then maybe it will get enough testing/bug-fixes to be more useful (though that's a big "if").

Varriount (orginal) [2014-10-17T05:08:54+02:00] view original

Oh! If you end up using Nimrod in any publically available research documents, tell us - then we can finally have a 3rd party citation that will enable us to have a Wikipedia page for Nimrod.

brianrogoff (orginal) [2014-10-17T17:05:03+02:00] view original

Jehan What Nim doesn't have that you do need, e.g., for matrix multiplication of non-square matrixes or multiplication, is higher-kinded types.

Looking through the documentation for the Eigen C++ library, I see a lot of template constant parameters but no template template parameters. For the use case of having array bounds checking for matrix multiplication, I don't see where higher kindedness comes in; you're just checking that for #columns of the left matrix equals #rows of the right one. What did you have in mind?

That said, I'm curious about how one would translate idiomatic template heavy C++-11 or D2 libraries to Nim. There are not only template constant params, but template template parameters and variadic template argument lists. Nim macros?

Jehan (orginal) [2014-10-17T17:47:49+02:00] view original

brianrogoff: For the use case of having array bounds checking for matrix multiplication, I don't see where higher kindedness comes in; you're just checking that for #columns of the left matrix equals #rows of the right one. What did you have in mind?

Let M[a,b] be an a-by-b matrix. Then we need a signature of the form:


proc `*`(m1: M[a, b], m2: M[b,c]): M[a,c]

To derive the return type generically, you need some form of type constructor as a parameter, i.e. higher-kinded types. (And yes, there are various ways to hack your way around it, none of them pretty.)

mason_mcgill (orginal) [2014-10-17T18:30:17+02:00] view original

Varriount: Will do!

brianrogoff: For my particular case, I haven't had to use template template parameters, but I was planning to use tuple template parameters in place of variadic templates (which I think is slightly cleaner, anyway).

brianrogoff (orginal) [2014-10-17T21:25:21+02:00] view original

Jehan Let M[a,b] be an a-by-b matrix. Then we need a signature of the form:


proc `*`(m1: M[a, b], m2: M[b,c]): M[a,c]

I didn't spend a lot of time reading the Eigen code but it looks like matrices with known compile time sizes simply use template constant parameters to do the size checking. If a,b, & c in your pseudo-Nim (:-)) above are integers or ranges with values known at compile time, I would expect that Nin could do the same, with no constructor parameters.

I must be dense, I still don't see why HKTs are so much better for this specific example than template constant params.

IMO, the case for HKTs (template template params in C++ parlance) are things like

template<class T> class Seq {
private:
  std::shared_ptr<std::vector<T>> data;
};

template<class K, class V, template<typename> class C = Seq>
class Map {
  C<K> key;
  C<V> value;
};

where they truly are parameterizing over the constructor.

Are there any plans to add these to Nimrod? How about variadic generics?

Jehan (orginal) [2014-10-17T22:42:39+02:00] view original

brianrogoff: I must be dense, I still don't see why HKTs are so much better for this specific example than template constant params.

With constant parameters you can't parameterize over M above. Whether it's "better" depends on your needs, of course.

Mirror of forum.nim-lang.org

589 :: Well, I'm pleased