Hi, version 0.2.1 of my linear algebra library is published!
I have added support for matrices and vectors whose size is not tracked at compile time, and they integrate nicely with those of static size. I also found some ways to better leverage the Nim type inference, in order to improve the public API. Finally, I added universal functions, that is, functions that are applied element-wise to vectors and matrices.
When I am back to work - hence I have my Nvidia card - I plan to make support for GPU types on par with those on the CPU.
;-)
Not yet. If you look here there are a LOT of things that need to be done.
One of those is Add a fallback Nim implementation, that is valid over other rings, since BLAS only works for floating point computations.
If you want to help, though, I can give you some directions ;-)
I would love to but not immediately. Looks like there is already a zip library, so it won't be too bad.
Here's the reference implementation: https://github.com/numpy/numpy/blob/v1.10.1/numpy/lib/npyio.py#L257
Value types have a good advantage, but the problem is, that when you make them too big, they could overflow the stack (iirc 2MB on Windows). But you should use value type arguments, because a ref type can always be converted to a value type with the [] operator.
proc foo1(arg: Matrix): Matrix
var matref1, matref2: ref Matrix
matref2[] = foo1(matref1[])
The other direction is not as easy. The only problem is that the result will be a new allocation, no matter if you use ref types or not. So I suggest to also provide an api that has result arguments
proc foo1(dst: var Matrix, arg: Matrix): void
proc foo1(arg: Matrix): Matrix =
result.foo1(arg)
This allows to reuse temporary matrices more efficiently.
proc foo2(arg: ref Matrix):
The fact is, matrices and vectors already contain a seq, which is allocated on the heap (at least in the common case), so there is not much more cost in adding an additional ref.
The only case where avoiding a ref would actually save time is when many matrices or vectors refer to the same underlying seq, but point at different places. A common case is iterating over the rows or columns of a matrix.
I have decided to just optimize this case for now, by reusing the same ref, and therefore avoiding allocations (well, there is just a single allocation for the whole iteration)
I've been scratching my head a lot for my tensor library between using the default seq value semantics or ref/shallowCopy.
For now I'm using the default seq value semantics at least until benchmarks or usage shows that it poses a memory/performance issue.
Here are the relevant Nim quotes:
From Araq: "Parameter passing doesn't copy, var x = foo() doesn't copy but moves let x = y doesn't copy but moves, var x = y does copy but I can use shallowCopy instead of = for that."
From Jehan: "First, it's important to understand that most of the time, you won't need shallowCopy at all. Copying is shallow by default if (1) the left-hand side of an assignment is a let variable or (2) the right-hand side is a function call."
One thing that I would argue for with any new linear algebra library is some sort of 'out' parameter, like numpy has for its operations that produce a vector. In an inner loop of a project using linalg, I had an operation like:
# simplified
const word2vec_size = 300
for i in 0 .. <NUM_STEPS:
var oldVector: Vector64[word2vec_size] += someOtherVector
when I changed this to:
const word2vec_size = 300
for i in 0 .. <NUM_STEPS:
var oldVector: Vector64[word2vec_size] += learning_rate * someOtherVector
I had an immediate slowdown - which (duh, my fault) was because in other to scale 'someOtherVector' and add it to the original, a temporary vector had to be created that held the results of scaling it by the learning rate. The easiest way of working around was to do some calls into nimblas that gave a pre-allocated vector to store the result of vector-scalar multiplication. Having two multiply procs (with/without 'out') or one with an optional 'out' parameter like:
import nimblas
proc multiply*(scale_val: float, input: VectorType; output: VectorType = nil): VectorType =
if output == nil:
output = newVectorTypeLike(input)
copy(len(input), input.fp, 1, output.fp, 1)
scal(len(input), scale_val, output.fp, 1)
# VectorType has to be reference type so that this doesn't result in a copy
result = output
would really help users avoid allocations.
@perturbation2
This is actually something that would help a lot in some portions. Users should not need to call BLAS manually, although that is always an option. It helps that in Neo I make al fields public - this is not very safe, but at least if a user needs to access a raw pointer in situations such as the above one, it is there.
That said, I could not find the time to expose two versions of each operation yet. I may eventually get to it, but if you want to contribute, a PR is always welcome!
Even better, there should be rewrite rules that match your slow call and transform it into the BLAS version without intermediate results. I will add some feature that lets me test whether rewrite rules are being applied, and then start adding some of the most common ones
Well rewrite rules need to be able to target these var result operations (as I call them) so they need to exist anyway.
This is btw the reason result is not a keyword, it allows easy refactorings from
proc foo(): T =
result = T()
to
proc foo(result: var T) =
result = T()
;-)
In fact, this problem is so severe (every proc returning a string is affected) I thought about more language support for it:
Every routine foo(result: var T; args) can be used as an expression of type T via the transformation into (var res = default(T); foo(res, args); res)
Sure, rewrite rules have to be written, and this is an ongoing work. I just added a way to test that they are in fact applied, since otherwise it is easy to get regressions.
Your suggestions for out parameters is neat, although I would not be able to use it directly, since many operations are operators in these libraries, and that form would not work for the versions with more parameters
Operators would work like anything else, but it might be ugly:
proc `+`(result: var T; x, y: T) = ...
`+`(res, a, b) # optimized call; store into 'res'
echo x + b # uses temporary when used as expression