Is Nim suitable for statistics (to replace things like R and Julia (and their statistics-oriented libraries))?
I haven't used Nim yet, but am already a kind of fan. I am curious to see the state of its ecosystem in the future.
There don't seem to be a lot of libraries in that direction yet, but Nim seems evidently capable of being a good choice.
Somn usor
Nim would be great at these kind of tasks, unfortunately libraries like this have not been written yet.
Only https://github.com/mratsim/Arraymancer comes to mind.
Consider however that most of the work done by statisticians with data involves interactive sessions: lLoad the data, do some plots, combine columns, quickly compute a few estimators, redo the plots, etc. R and Julia have the advantage of being REPL-oriented and have good support for Jupyter, which means that this kind of workflow is somewhat natural.
Statically-typed languages like C++ and Nim are perfect if you want to implement a data analysis pipeline whose blocks are already settled, but they are not great for interactive data exploration, where you do not know exactly what you are going to find in your data.
Nim is not ready, even Python is missing several key R statistics packages.
I do try to implement what people find useful in Arraymancer, for example I implemented randomized SVD and randomized PCA
With the results here: http://home.chpc.utah.edu/~u6000771/somalier-ancestry.html
And while we are at it, I have some API RFC for column preprocessing: https://github.com/mratsim/Arraymancer/issues/385. How to indicate a common transformation to all stats like PCA or logistic regression:
enum
type FeaturePreprocessing = enum
Auto
NoPreprocessing
MeanCentering
MinMaxScaling
StandardScaling
RobustScaling
...
Regarding exploration, Nim compiles fast, you can play with the following to run Nim in Jupyter: https://github.com/apahl/nim_magic (similar to cython in Jupyter).
Also lastly Status sponsored a developer to add hot-code reloading to Nim so that Nim code could be modified while running. Running in Jupyter is typically one of the use-cases, we just Nim someone to write the package: https://nim-lang.org/docs/hcr.html (There are other experiments: https://github.com/stisa/jupyternim)
we just Nim someone to write the package
Assuming you meant "need" this is the best ever either slip of the tongue/keyboard and/or entry auto-correct. "Nim".."need"..all the same. ;-)
Very interesting, very interesting. Thanks.
@zio_tom78, I also thought about the lack of an REPL. Once Nim has it, this should certainly make it all better.
I don't know much about Statistics, but this project might help you. Nim4Colab is a IPython extension to use Nim language on Google Colaboratory.
Github repo: https://github.com/demotomohiro/nim4colab
Related Nim forum thread: https://forum.nim-lang.org/t/4944
Probability
There is alea, though not integrated with the rest: https://github.com/unicredit/alea I do need very efficient sampling as well for natural language modelling so I already have a prototype of a fast sampler without replacement for the multinomial distribution. Actually I'm confident that it's currently state-of-the-art: https://github.com/numforge/laser/blob/master/benchmarks/random_sampling/fenwicktree.nim
Statistical summary of data
NimData probably has it, otherwise open an issue there.
Overarching lib for plotting
Well we need good plotting packages first but since Vindaar is doing them all
Vega Lite
I have a prototype here https://github.com/numforge/monocle: it works, the code is very simple (50 LOC), I just don't have time to dedicate to it to integrate with Arraymancer or Nimdata.
Central tendency and mean (...)
All are implemented in Arraymancer and parallelized
Loading formats
NimData supports CSV, Arraymancer supports CSV, Numpy, HDF5
Stable binary storage
Arraymancer can load Numpy and HDF5
Querying Dataframes
Well there is only one dataframe library at the moment
Voyager
AFAIK Vega-Lite format can directly interop with Voyager
REPL
Already mentioned
In short, the ecosystem is not there. There is a lot of work but there are proof of concepts in all important area: ndarrays, plotting, dataframes, even REPL, binary format. There are also a couple of contributors and not just one person, and some actually use it for their research and contribute tooling.
So it's certainly not R or Python or Julia but among all the fast statically typed language (including D, Rust, Zig, V, Crystal, Fortran, Ocaml, Haskell, ...), I don't see any other language coming close in terms of:
Now I've left aside C C++, they have a REPL (cling, xeus), they have everything needed for stats as they serve as the backends for Python and R, but they would feel heavyweight for iterative exploration of a dataset, especially C++ because of the busy syntax and the awful compile-times. And another thing C and C++ are missing is a good package manager.
Lastly, I think the main advantage of Nim over Python or R is maintenance and deployment, obviously for a scientist as long as it works on their machine it's good but when you need to deploy a model dealing with all the dependencies is very painful while you could ship a binary with Nim or other compiled languages instead of a Docker.
Mostly academic points to correct the record, but I feel like mratsim's linked "Fenwick Tree" uses up to 2x the cache memory of the trees of Fenwick 1993,1994,1995. So, maybe not quite even state of the art for the mid-1990s (depending on metrics/data sizes)?
Also, maybe not really even Fenwick Trees? The Yu2015 paper referenced in the code very questionably re-brands what Wong 1980 had called simply "binary trees" as seemingly new but not "F+Trees" while removing the space optimization that was Fenwick94's contribution over Wong80. All quite weird.
Don't get me wrong - that sampler is good code. Maybe not optimal, though.
@cblake: I'm pretty sure I don't. For a tree of size N (power-of-2) I use 2N-1 memory. If it uses more it's a bug. Also AFAIK traditional fenwick trees often use one-based indexing but those requires 2x memory when N is a power of 2.
I agree with the F+tree sentiment, I was a bit confused between both when doing my research, the main benefit the paper provides is a baseline to measure against and related datasets like https://archive.ics.uci.edu/ml/datasets/Bag+of+Words: up to 140000 unique words and 730M words to parse and get a frequency table from. Here are my research references: https://github.com/numforge/laser/blob/master/research/random_sampling_optimisation_resources.md.
Anyway we can take this in Laser, feel free to open an issue if there are bugs or PR in new research. Unfortunately this will probably take a backseat until I actually have time or refocus on NLP (like a year).
For an alphabet size A (any int), an actual Fenwick Tree needs only n = ceil(log_2(A)) array slots. If A is a power of two they need no extra memory over a simple histogram. Your fenwicktree.newSampler uses 2*n-1 (for that same n). And it's not some +-0/1-origin indexing threshold thing, but a legit 2x through the whole last power of 2. So, we don't disagree on what you use, but on what you could use.
That (admittedly confusing) paper even says "In fact, Fenwick Tree can be regarded as a compressed version of the F+ tree" (which itself kind of exhibits their terminological weirdness). It is to this "compressed" that I refer. If you don't want to think about it for a year that's ok. It is pretty far off topic. There are even further off topic things like https://arxiv.org/pdf/1612.09083.pdf that warrants a mention for the curious.
Oh, and while I'm sure mratsim already knows all this, because this is a thread about Nim and statistics and R and other things and most passers by/search results finers probably do not know this, it bears noting that these binary indexed sum trees are often the fastest way to compute things like medians or other quantiles or other order statistics over moving data windows, at least up to a certain precision aka number of bins.
Such CDF operations are almost the same as sampling without replacement. In both cases one is maintaining a CDF and/or its inverse as data comes & goes (with moving data windows the data both comes and goes while with sampling it only goes). With Wong/Fenwick/binary/Fibonacci/whatever decomposition indexed sum trees, the per-datum scaling becomes about the alphabet size or precision not the window size. (There is always an overall O(N data points) factor for the whole data set not just each window.)
By comparison, the R runmed implemented in Srunmed.c, for example, has atrocious O(w) scaling in window size w. Srunmed.c, while better than the most naive sort-every-data-window-not-even-with-a-radix-sort O(w*log(w)) is very poor work. It does not even do the binary-search-of-an-ordered-array-with-AVX-optimized-memmove adjustments which would be both simpler than what Srunmed.c already does and a significant constant factor speed up as well as generalizing to arbitrary quantiles.
These index decomposition sum trees' (probably the most suggestive name) w-independence is even better unless you need too many bins. 64..256 KiBins is a lot of resolution and fits in L3 these days for even 8 byte counter sizes, though 4B or even 2B are often enough. This is obviously easier with the 2x space optimization I mentioned above. True full precision for index decomposition sum trees needs a bin for every possible number or say 2^32 bins for a float32 which can get very memory intensive or require expensive-in-the-small sparse arrays/hashing.
At loss of precision with numbers (as opposed to mratsim's words case), you can have some, e.g., exponentially spaced bin mapping from the full range to a more manageable amount. If you need full precision/exact answers then usually the best performer is a B-tree augmented to be an order statistics tree can get you "full precision" with O(lg(w)) per datum scaling. Even a ranked binary search tree as per KnuthV3 in the late 1960s is better than R's Srunmed.c, though.
w can easily be in the 1000s and N millions or more. So, R's current core impl is potentially leaving many orders of magnitude of performance on the table. Of course, people using R are rarely very sensitive to performance.
@Lecale, in that case, thanks.
@mikebelanger, no, I haven't, due to lack of time for it. It does look good.
@mratsim, then we seem to more or less concur in general and that Nim & ecosystem are not ready at the moment, while they look very promising. I wish I could help out with these endeavours.