nimforum mirror - Write Nim in Matlab/Julia style using macros while still deploy to Cloud/PC/GPU/embedded?

eyes (orginal) [2020-05-21T18:58:25+02:00] view original

I am a newbie in Nim, impressed by its powerful yet easy-to-understand grammar, high performance, and low resource usage. Compiling to C is also brilliant in order to target numerous platforms with minimal reinventing-the-wheels effort.

I've been a user of Julia since end of 2013. It has laser focus on scientific/numerical computing, with grammar similar to Matlab/Python, but by default you get pretty good performance (after first-pass of JIT), and when needed one can perform deep fine tuning to achieve amazing performance. In addition, parallel computing (both multi-threading and multi-processing) and GPU computing are so easy to begin with. Therefore I enjoy using it a lot.

Julia aims to solve the 2-language problem, unifying model development and production running of the model. I believe it can achieve the goal for most researchers /data scientists / academic users. Those who are school-trained in Matlab and/or Python will find Julia easy to start with, while providing sufficient performance and scalability for production in PC or cloud.

However, Julia is a dynamically typed language, so for engineers who need to bring the model down to embedded hardware (DSP / microcontroller, or even FPGA / ASIC) where power consumption and resource conservation requirements are competitive, I still see Julia limited, maybe fundamentally.

Certainly today one can always buy more powerful things e.g. microcontrollers to run anything - Python can run on microcontroller, so Julia can as well. However, if one goes beyond hobby projects and wants to develop products in either mobile, wearable, IoT, industrial control, telecom, infrastructure ... etc. areas, ease of development and speed to market are nice, but power consumption + size + cost also need to be on par with state of the art or the product will probably lose in competition. So for embedded development the 2 language problem is still there: researchers and engineers like to develop their models/algorithms in Matlab/Python/Julia, but firmware developers need to convert them to C (not even real C++).

Due to Nim's wide range of capabilities, including the latest ARC memory management approach, I am just wondering if it can play a great role there: use its powerful meta programming capability to develop macros so that for those who are experienced in Matlab/Python/Julia, they can just happily code in the familiar "scientific" styles. (Of course some aspects of the dynamical/interpreted programming will not be available, but I see an acceptable REPL is already here, and as an engineer who uses REPL a lot I care a lot more about being able to try something quickly in REPL than Duck typing). Once the R&D part of the code is done, the same modules, after more rigorous test, can be directly used in the end target for production, being it cloud (including web), PC, mobile device, or embedded systems.

Of course I'm not expecting everything to be optimal with exactly the same code - each execution environment has very different architecture and very different requirements on performance/resources, but at least sharing key algo modules is a great starting point, in addition during the development process those who designed the model and those who implement the model in the product are talking in the same language (pun intended :-)). This can make things a lot smoother: the model designer can try small tweaks in algo and see how it works right away, while those who implement the model may send pieces of the implementation back to the designer to validate/debug. And in small teams the model designer and implementer may be the same person, in that case using the same programming language for both can save a lot of wasted brain cycles. :-)

I was also considering Rust for this, but Rust doesn't feel "natural" to scientists and engineers familiar with Matlab/Python, and I doubt it would ever be. Plus I guess Nim with C as its backend can reach more embedded systems quicker, since C is supported by almost everything out there.

But is this within Nim's capabilities? Or will Nim also be fundamentally limited in some way? Again I'm a newbie, so I'd love to learn what others think of such possibility.

Araq (orginal) [2020-05-21T23:15:51+02:00] view original

But is this within Nim's capabilities?

Well yes, I think so. But maybe I'm a little biased. ;-)

eyes (orginal) [2020-05-21T23:41:23+02:00] view original

Biased, yes; but also most qualified to answer this question. :-) Thanks for the confirmation!

Right now the scientific/numerical computing ecosystem of Nim is still tiny compared to Matlab/Python/Julia. However, since there is no fundamental limit to reach from cloud to PC to GPU & embedded devices, adding the friendly grammar (which can be friendlier with the help of macros), high performance, and REPL support, I don't see any reason why Nim wouldn't make a great language for scientific/numerical computing.

Maybe that's not Nim's focus as a system programming language, but I see things connected: scientists and engineers need to do research, but some of them also need to make production ready code, and make web and/or other interface for others to access their work.

I don't have the expertise to create such libraries myself (I'm more like a daily user of Matlab/Python/Julia with some old knowledge of C/C++), but for those who will author Nim scientific libraries, maybe one big consideration should be easy on-boarding for people from other science-friendly languages (Matlab/Python/Julia, maybe also R) without losing the strength of Nim as a system programming language (i.e. can reach down to hardware, so just wrapping a Python or R library wouldn't do).

Just my biased 2-cents. :-)

I'll definitely follow the progress in this area and try to make contribution when I am able to.

mratsim (orginal) [2020-05-22T01:55:21+02:00] view original

You're welcome to join the SciNim chat: https://gitter.im/SciNim/community

Regarding embedded and metaprogramming example, you might be interested in my Synthesis repo. It's a state machine generator implemented as a custom DSL with Nim and Graphviz backend. It's very high performance, you probably can't beat it with pure C (no allocation at all, no indirect dispatch via tables or switch, the generated code is pure goto-based and avoids branch misprediction due to having a single dispatch point that confuses the hardware predictors).

Regarding science, you probably came across Arraymancer and ggplotnim

Be sure to check the Are we scientist yet?<https://github.com/nim-lang/needed-libraries/issues/77>_ thread.

And if you want to see an example of metaprogramming of Nim vs Julia, you can check my submission to the Julia metaprogramming challenge

I.e. in 200 lines of code, you have a multidimensional array/tensor type with supports for any number of dimensions, broadcasting (the julia dot operator) and iterations on a variadic number of tensors.

I've also made the code about 40% faster when iterating on strided tensors resulting from slices in Laser.

Depending on your embedded devices, you might also want to develop an assembler, Nim macros makes it possible to create a DSL to map the instructions, for example for x86:

# Notes:
#   - The imm64 version will generate a proc for uint64 and int64
#     and another one for pointers immediate
#   - The dst64, imm32 version will generate a proc for uint32 and int32
#     and a proc for int literals (known at compile-time)
#     that will call proc(reg, imm32) if the int is small enough.
#     ---> (dst64, imm64) should be defined before (dst64, imm32)

op_generator:
  op MOV: # MOV(dst, src) load/copy src into destination
    ## Copy 64-bit register content to another register
    [dst64, src64]: [rex(w=1), 0x89, modrm(Direct, reg = src64, rm = dst64)]
    ## Copy 32-bit register content to another register
    [dst32, src32]: [          0x89, modrm(Direct, reg = src32, rm = dst32)]
    ## Copy 16-bit register content to another register
    [dst16, src16]: [    0x66, 0x89, modrm(Direct, reg = src16, rm = dst16)]
    
    ## Copy  8-bit register content to another register
    [dst8,  src8]:  [          0x88, modrm(Direct, reg = src8, rm = dst8)]
    
    ## Copy 64-bit immediate value into register
    [dst64, imm64]: [rex(w=1), 0xB8 + dst64] & imm64
    ## Copy 32-bit immediate value into register
    [dst64, imm32]: [          0xB8 + dst64] & imm32
    ## Copy 16-bit immediate value into register
    [dst64, imm16]: [    0x66, 0xB8 + dst64] & imm16
    
    ## Copy 32-bit immediate value into register
    [dst32, imm32]: [          0xB8 + dst32] & imm32
    ## Copy 16-bit immediate value into register
    [dst32, imm16]: [    0x66, 0xB8 + dst32] & imm16
    
    ## Copy 16-bit immediate value into register
    [dst16, imm16]: [    0x66, 0xB8 + dst16] & imm16
    ## Copy  8-bit immediate value into register
    [dst8,  imm8]:  [          0xB0 + dst8, imm8]
  
  op LEA:
    ## Load effective address of the target label into a register
    [dst64, label]: [rex(w=1), 0x8D, modrm(Direct, reg = dst64, rm = rbp)]
  
  op CMP:
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm64]: [ rex(w=1), 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm64
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm32]: [           0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm32
    ## Compare 16-bit immediate with 16-bit int at memory location stored in adr register
    [adr, imm16]: [     0x66, 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm16
    ## Compare 8-bit immediate with byte at memory location stored in adr register
    [adr, imm8]:  [           0x80, modrm(Indirect, opcode_ext = 7, rm = adr[0]), imm8]
  
  op JZ:
    ## Jump to label if zero flag is set
    [label]: [0x0F, 0x84]
  op JNZ:
    ## Jump to label if zero flag is not set
    [label]: [0x0F, 0x85]
  
  op INC:
    ## Increment register by 1. Carry flag is never updated.
    [dst64]: [rex(w=1), 0xFF, modrm(Direct, opcode_ext = 0, rm = dst64)]
    [dst32]: [          0xFF, modrm(Direct, opcode_ext = 0, rm = dst32)]
    [dst16]: [    0x66, 0xFF, modrm(Direct, opcode_ext = 0, rm = dst16)]
    [dst8]:  [          0xFE, modrm(Direct, opcode_ext = 0, rm = dst8)]
    ## Increment data at the address by 1. Data type must be specified.
    [adr, type(64)]: [rex(w=1), 0xFF, modrm(Indirect, opcode_ext = 0, rm = adr[0])]
    [adr, type(32)]: [          0xFF, modrm(Indirect, opcode_ext = 0, rm = adr[0])]
    [adr, type(16)]: [    0x66, 0xFF, modrm(Indirect, opcode_ext = 0, rm = adr[0])]
    [adr, type(8)]:  [0xFE, modrm(Indirect, opcode_ext = 0, rm = adr[0])]
  
  op DEC:
    ## Increment register by 1. Carry flag is never updated.
    [dst64]: [rex(w=1), 0xFF, modrm(Direct, opcode_ext = 1, rm = dst64)]
    [dst32]: [          0xFF, modrm(Direct, opcode_ext = 1, rm = dst32)]
    [dst16]: [    0x66, 0xFF, modrm(Direct, opcode_ext = 1, rm = dst16)]
    [dst8]:  [          0xFE, modrm(Direct, opcode_ext = 1, rm = dst8)]
    ## Increment data at the address by 1. Data type must be specified.
    [adr, type(64)]: [rex(w=1), 0xFF, modrm(Indirect, opcode_ext = 1, rm = adr[0])]
    [adr, type(32)]: [          0xFF, modrm(Indirect, opcode_ext = 1, rm = adr[0])]
    [adr, type(16)]: [    0x66, 0xFF, modrm(Indirect, opcode_ext = 1, rm = adr[0])]
    [adr, type(8)]:  [0xFE, modrm(Indirect, opcode_ext = 1, rm = adr[0])]

And usage for a brainfuck JIT assembler (complete with clobbered registers cleanup):

while not stream.atEnd():
      case stream.readChar()
      of '>': a.inc rbx          # Pointer increment
      of '<': a.dec rbx          # Pointer decrement
      of '+': a.inc [rbx], uint8 # Memory increment
      of '-': a.dec [rbx], uint8 # Memory decrement
      of '.': a.os_write()       # Print
      of ',': a.os_read()        # Read from stdin
      of '[':                    # If mem == 0, Skip block to corresponding ']'
        let
          loop_start = initLabel()
          loop_end   = initLabel()
        a.cmp [rbx], uint8 0
        a.jz loop_end
        a.label loop_start
        stack.add (loop_start, loop_end)
      of ']':
        let (loop_start, loop_end) = stack.pop()
        a.cmp [rbx], uint8 0
        a.jnz loop_start
        a.label loop_end
      else:
        discard

I have plenty of other metaprogramming examples:

Neural network DSL

Simulating classes with ADTs

Creating a compiler for linear algebra and deep-learning

Creating matrix multiplication kernels as fast or faster than (pure assembly) OpenBLAS

Recreating the OpenMP syntax for multithreading

Implementing Einstein Summation

so ask away

eyes (orginal) [2020-05-22T03:11:06+02:00] view original

Thanks! Yes I did notice Arraymancer - it looks very exciting. Also I noticed Neo. I haven't tried them yet - will do that in the next few weeks.

For plotting I am most used to Python's matplotlib and Julia's PyPlot wrapper. I am also playing with gnuplot since it's supported by almost any language out there. There is a nim package for gnuplot, too, which is great. Will try it out.

Just a question about Arraymancer + Neo (or any other linear algebra package). For a most useful quick demo to be compared to Matlab and Julia, these are the most useful functions I would need:

nd-arrays (vectors and matrices) of integer, floating number, or complex values.

Slicing, concatenation, transposing... these sorts of array operations.

Broadcast basic math functions to any array.

Linear algebra (e.g. matrix multiplication, solving linear equations).

Complex 1D FFT and IFFT - this one is very important.

Some digital signal processing functions, but that's less important.

All above, running in CPU only (with MKL and/or automated multi-threading e.g. for large FFT/IFFT); plus running in GPU with minimal code change - mostly only converting CPU arrays to GPU arrays, then converting them back after the computation.

(In near future: Basic descriptive statistical functions; 1D and 2D linear and spline interpolation; 1D and 2D polynomial fitting; numerical integration; ODE solver.)

I'm not expecting a comprehensive set of solutions in Nim today. However, would be really nice to know what already exists. :-)

Thanks again for the answer, and for the amazing Nim & Arraymancer package!

andrea (orginal) [2020-05-22T09:19:31+02:00] view original

I am not sure about the current status of Arraymancer, but I believe it supports most of your requests. Neo, instead, supports only:

Slicing, concatenation, transposing... these sorts of array operations.

Broadcast basic math functions to any array.

Linear algebra (e.g. matrix multiplication, solving linear equations).

All above, running in CPU only (with MKL and/or automated multi-threading e.g. for large FFT/IFFT); plus running in GPU with minimal code change (well, some Lapack things are CUP only for now)

The fact is that when I was developing Neo, @mratsim started Arraymancer and improved it a lot. Nowadays, Arraymancer is more advanced and fast: @mratsim has implemented Laser, Weave for multithreading and more, and I don't really see a point in adding many new features to Neo. It will surely be maintained, and what it does, it does decently, but I don't want to duplicate the great effort of @mratsim.

mratsim (orginal) [2020-05-22T11:08:23+02:00] view original

Note: the docgen for API is still not ideal, there are internal stuff listed

nd-arrays (vectors and matrices) of integer, floating number, or complex values.

Yes

Slicing, concatenation, transposing... these sorts of array operations.

https://mratsim.github.io/Arraymancer/tuto.shapeshifting.html

CPU: https://mratsim.github.io/Arraymancer/shapeshifting.html

CUDA: https://mratsim.github.io/Arraymancer/shapeshifting_cuda.html

Linear algebra (e.g. matrix multiplication, solving linear equations).

Matrix multiplication

CPU: https://mratsim.github.io/Arraymancer/operators_blas_l2l3.html

CUDA: https://mratsim.github.io/Arraymancer/operators_blas_l2l3_cuda.html

OpenCL: https://mratsim.github.io/Arraymancer/operators_blas_l2l3_opencl.html

Solvers, matrix decomposition, PCA ..., CPU only at the moment

https://mratsim.github.io/Arraymancer/least_squares.html

https://mratsim.github.io/Arraymancer/linear_systems.html

https://mratsim.github.io/Arraymancer/decomposition.html

https://mratsim.github.io/Arraymancer/pca.html

https://mratsim.github.io/Arraymancer/decomposition_rand.html

1D FFT, IFFT

Not implemented, wrapping MKL FFT can be a weekend project with c2nim or nimterop https://software.intel.com/content/www/us/en/develop/documentation/mkl-developer-reference-c/top/appendix-e-code-examples/fourier-transform-functions-code-examples/fft-code-examples.html

Implemening a pure Nim FFT is something I want to do at one point but lack of time.

All above, running in CPU only (with MKL and/or automated multi-threading e.g. for large FFT/IFFT)

You can use OpenBLAS or MKL in both Neo or Arraymancer.

That said, you can write pure Nim code that has performance similar to both OpenBLAS and MKL. I track benchmarks of pure Nim implementation with threading via Laser (using the Nim OpenMP operators) and Weave here: https://github.com/mratsim/weave/tree/master/benchmarks/matmul_gemm_blas

iterator `||`[S, T](a: S; b: T; annotation: static string = "parallel for"): T
  ## See https://nim-lang.org/docs/system.html#%7C%7C.i%2CS%2CT%2Cstring
iterator `||`[S, T](a: S; b: T; step: Positive; annotation: static string = "parallel for"): T

Last time I optimized this, I could reach 2.8 TFlops with Weave, 2.8 TFlops with Laser + OpenMP, 2.7 TFlops on OpenMP, 3 TFlops for MKL and 3.1 TFlops with Intel oneDNN (https://github.com/mratsim/weave/pull/94#issuecomment-571751545) but i started from single-threaded performance of 160GFlops vs Intel and OpenBLAS 200GFlops on a 18-core machine.

GPU

Yes but minimal, Cuda and OpenCL at the moment

Statistical functions

PCA and SVD are well developped and actually 2x to 10x faster than in any other language (including Sklearn latest optimizations and Facebook's PCA)

https://github.com/mratsim/Arraymancer/pull/384#issuecomment-536682906

SPline, Numerical integration and ODE

https://github.com/HugoGranstrom/numericalnim

eyes (orginal) [2020-05-22T18:42:05+02:00] view original

Thanks a lot for this information and for the Neo package! I'll take your advice.

eyes (orginal) [2020-05-22T18:43:02+02:00] view original

Wonderful! Arraymancer is definitely doing more than I thought. I will go through the links carefully and try these features. Thanks!

Mirror of forum.nim-lang.org

6363 :: Write Nim in Matlab/Julia style using macros while still deploy to Cloud/PC/GPU/embedded?