nimforum mirror - Arraymancer - 2023-12-31 - Overview of longstanding missing features

mratsim (orginal) [2023-12-31T13:22:59+01:00] view original

Hello everyone,

It's been a while since I worked on Arraymancer, first of all thanks to all who contributed and maintained the library, for example to make it Nim v2 compatible.

I've collated the big items I think are critical for the library in this issue: https://github.com/mratsim/Arraymancer/issues/616

Arraymancer has become a key piece of Nim ecosystem. Unfortunately I do not have the time to develop it further for several reasons:
family, birth of family member, death of hobby time.

competing hobby, I've been focusing on cryptography the last couple years, and feel like Nim has also an unique niche there, and I'm even accelerating Rust libraries with a Nim backend.

pace of dev, the deep learning community was moving rapidly in 2012 / 2018, today it's very very very fast and hard to compete. Not to say it's impossible, but you need better infrastructure to catch up.

Furthermore, since then Nim v2 introduced new interesting features like builtin memory management that works with multithreading or views that are quire relevant to Arraymancer.

Let's go over the longstanding missing features to improve Arraymancer, we'll go over the tensor library and over the neural network library.

Tensor backend (~NumPy, ~SciPy)
Mutable operations on slices

Nested parallelism

Doc generation

Versioning / releases

Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...)

Windows: BLAS and Lapack deployment woes

MacOS: OpenMP woes

MacOS: Tensor cores usage

Also: the need for untyped Tensors.

Neural network backend (~PyTorch)
Nvidia Cuda

Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized

Ergonomic serialization and deserialization of models

Slowness of reduction operations like sigmnoid or softmax the more cores you have

I've detailed the problems and solutions in the issue, and here is the summary of actions:

Summary

In summary, here is how I see how to make more progress on Arraymancer, tensor libraries, deep learning in Nim.

Create a compiler for tensor arithmetic which supports CPU, OpenCL, Nvidia backends.

Optimization at the compiler level is likely out of reach and would require a polyhedral optimizer which is why Lux (or Halide) exposed low-level optimization details (tiling, parallelism, intermediate storage).

https://www.youtube.com/watch?v=UeyWo42_PS8

https://blog.minhazav.dev/write-fast-and-maintainable-code-with-halide/

https://cacm.acm.org/magazines/2018/1/223877-halide/fulltext

https://dl.acm.org/doi/10.1145/3150211

Note, for the autogenerated backpropagation, we will need an optimizer

Use Weave or Constantine's threadpool instead of OpenMP for the CPU parallel runtime.

Implement a pure Nim BLAS & Lapack replacement. This may or may not use the compiler.

Port Arraymancer primitives to those that compiler, threadpool and BLAS/Lapack.

Create a new library with type-erased tensors and (de)serializable models that will be focused on scientific computing interop (.csv, .onnx, .tfrecords) and deep-learning

Deprecate the deep learning part of Arraymancer, keeping only Numpy/Scipy/Tensor, and redirect deep learning needs to that new library.

Feedback welcome.

Note that I do not have the time to do this though I can certainly mentor.

didlybom (orginal) [2024-01-01T18:11:26+01:00] view original

This is a really ambitious plan. Probably it can only be accomplished by several people working together.

Personally I think this is an area where nim could excel given sufficient community investment. While it’s true that the deep learning community is moving very fast, the arrival of C based libraries like ggml proves that there is still a place for low level, really fast libraries and that even a single, dedicated, brilliant developer can still make a difference. I wonder what would have happened if ggml and llama.cpp had been written in nim…

Anyway, I got a couple of more specific comments:

Regarding running blas and lapack on windows it seems to me that the situation is a little better now. Download the right dll, place it in the path and it works. Would there be other benefits to reimplementing blas and la pack in nim? Do you have an estimation of how much effort that would require? Do those libraries change a lot over time?

Regarding the switch from OpenMP to one of the native nine thread pop libraries: how hard do you think that would be to do?

Some of the items you mention in the GitHub issue seem pretty hard to do…

mratsim (orginal) [2024-01-02T10:44:15+01:00] view original

Regarding running blas and lapack on windows it seems to me that the situation is a little better now. Download the right dll, place it in the path and it works. Would there be other benefits to reimplementing blas and la pack in nim? Do you have an estimation of how much effort that would require? Do those libraries change a lot over time?

The hardest BLAS function is already implemented and competitive:

OpenMP version: https://github.com/mratsim/laser/tree/master/laser/primitives/matrix_multiplication

Weave version: https://github.com/mratsim/weave/tree/master/benchmarks/matmul_gemm_blas

einsum version, slow: https://github.com/mratsim/Arraymancer/blob/master/src/arraymancer/tensor/einsum.nim#L42-L53

compiler version, differentiable: https://github.com/can-lehmann/exprgrad/blob/905752c/exprgrad/layers/base.nim#L27-L28, unsure about the speed of this one

Regarding the switch from OpenMP to one of the native nine thread pop libraries: how hard do you think that would be to do?

It's mostly changing for i in 0||len-1 to parallelFor i in 0 ..< len and doing explicit captures.

Some of the items you mention in the GitHub issue seem pretty hard to do…

Which ones?

Mirror of forum.nim-lang.org

10818 :: Arraymancer - 2023-12-31 - Overview of longstanding missing features

Tensor backend (~NumPy, ~SciPy)

Neural network backend (~PyTorch)