Hello everyone,
It's been a while since I worked on Arraymancer, first of all thanks to all who contributed and maintained the library, for example to make it Nim v2 compatible.
I've collated the big items I think are critical for the library in this issue: https://github.com/mratsim/Arraymancer/issues/616
Arraymancer has become a key piece of Nim ecosystem. Unfortunately I do not have the time to develop it further for several reasons:
- family, birth of family member, death of hobby time.
- competing hobby, I've been focusing on cryptography the last couple years, and feel like Nim has also an unique niche there, and I'm even accelerating Rust libraries with a Nim backend.
- pace of dev, the deep learning community was moving rapidly in 2012 / 2018, today it's very very very fast and hard to compete. Not to say it's impossible, but you need better infrastructure to catch up.
Furthermore, since then Nim v2 introduced new interesting features like builtin memory management that works with multithreading or views that are quire relevant to Arraymancer.
Let's go over the longstanding missing features to improve Arraymancer, we'll go over the tensor library and over the neural network library.
Tensor backend (~NumPy, ~SciPy)
- Mutable operations on slices
- Nested parallelism
- Doc generation
- Versioning / releases
- Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...)
- Windows: BLAS and Lapack deployment woes
- MacOS: OpenMP woes
- MacOS: Tensor cores usage
Also: the need for untyped Tensors.
Neural network backend (~PyTorch)
- Nvidia Cuda
- Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized
- Ergonomic serialization and deserialization of models
- Slowness of reduction operations like sigmnoid or softmax the more cores you have
I've detailed the problems and solutions in the issue, and here is the summary of actions:
Summary
In summary, here is how I see how to make more progress on Arraymancer, tensor libraries, deep learning in Nim.
- Create a compiler for tensor arithmetic which supports CPU, OpenCL, Nvidia backends.
- Optimization at the compiler level is likely out of reach and would require a polyhedral optimizer which is why Lux (or Halide) exposed low-level optimization details (tiling, parallelism, intermediate storage).
- https://www.youtube.com/watch?v=UeyWo42_PS8
- https://blog.minhazav.dev/write-fast-and-maintainable-code-with-halide/
- https://cacm.acm.org/magazines/2018/1/223877-halide/fulltext
- https://dl.acm.org/doi/10.1145/3150211
- Note, for the autogenerated backpropagation, we will need an optimizer
- Use Weave or Constantine's threadpool instead of OpenMP for the CPU parallel runtime.
- Implement a pure Nim BLAS & Lapack replacement. This may or may not use the compiler.
- Port Arraymancer primitives to those that compiler, threadpool and BLAS/Lapack.
- Create a new library with type-erased tensors and (de)serializable models that will be focused on scientific computing interop (.csv, .onnx, .tfrecords) and deep-learning
- Deprecate the deep learning part of Arraymancer, keeping only Numpy/Scipy/Tensor, and redirect deep learning needs to that new library.
Feedback welcome.
Note that I do not have the time to do this though I can certainly mentor.
This is a really ambitious plan. Probably it can only be accomplished by several people working together.
Personally I think this is an area where nim could excel given sufficient community investment. While it’s true that the deep learning community is moving very fast, the arrival of C based libraries like ggml proves that there is still a place for low level, really fast libraries and that even a single, dedicated, brilliant developer can still make a difference. I wonder what would have happened if ggml and llama.cpp had been written in nim…
Anyway, I got a couple of more specific comments:
Regarding running blas and lapack on windows it seems to me that the situation is a little better now. Download the right dll, place it in the path and it works. Would there be other benefits to reimplementing blas and la pack in nim? Do you have an estimation of how much effort that would require? Do those libraries change a lot over time?
The hardest BLAS function is already implemented and competitive:
Regarding the switch from OpenMP to one of the native nine thread pop libraries: how hard do you think that would be to do?
It's mostly changing for i in 0||len-1 to parallelFor i in 0 ..< len and doing explicit captures.
Some of the items you mention in the GitHub issue seem pretty hard to do…
Which ones?