There seem to be quite a few data/computational science people lurking around here in the Nim community (@mratsim among others). Is there any interest in us creating a place where such stuff can be discussed? I think it would be helpful to have a dedicated place to discuss such matters, especially for people coming from Python (the main data science language nowadays) as they might (as I did) see Nim as a "faster Python". And having a dedicated place to discuss is less intimidating and then getting into the Nim IRC/Gitter/Discord, especially if you just check it a few times a day and miss most of the discussions.
Is there anyone who would be interested? If so, what platform should we use? (Gitter/IRC, subreddit etc)
Why not just start on Nim forum and Nim IRC and switch to separate area when number of participants becomes too large, like maybe a few dozen?
I really wonder why some people like to separate so much? We all know that Nim community is still small, and there is no evidence that many people are really working in data/computational science. And data science discussion on Nim forum is interesting for many and it is good advertising for Nim.
For other topics it is similar -- for example I got some queries to create a Nim-GTK forum, Channel, IRC. Last one recently, see https://github.com/StefanSalewski/gintro/issues/54#issuecomment-532497567 Well, there would be one user if I subscribe myself, and maybe one more a few times a month.
I get your point ;) The forum works fine, but it's the IRC part that I'm a bit bothered by. I usual check the Gitter once every 2 hours and during that time there has been too much going on (usually) and I don't have the time to go through it all. If there has been anything about data science, it has most probably been drained in the core devs discussions or people asking more general Nim questions (which is really good :-D don't get me wrong, I love that part of the Nim community and its engagement).
I'm also starting this thread to see IF there are people interested ;) if not, I would see no use in a separate community
I get your point too :-)
Currently in IRC is much discussion due to 1.0 release. But generally traffic is lower. I look at IRC logs sometimes, and I have never seen much discussion about data science. Most is compiler dev, bugs, game dev, crypto. And some noise.
My feeling is, when someone starts a discussion and there is someone other interesting in that, than a longer discussion follows. zcharter does this often, for his game engines. So you may just try to start a data science discussion when there is no important other discussion on IRC and see if there are other people.
But maybe some other people will join this thread, so we will see. I am interested in science too (working on rtree bulk loading and k nearest neigthbor search just now), but I will never join a separate channel.
Hi, +1 to discuss scientific Nim in this forum until community is going to grow big enough. I am making a number of scientific calculations with Nim but from stochastic differential equations and Monte Carlo simulations for finance and insurance angle rather than neural networks and data mining.
DataTable package was my pilot project to get more familiar with Nim. IMO, to finish it woulr require a couple of improvements in Nim itself, namely: chaining of iterators and static[T] type improvements.
I think we can replace iterator chaining by objects that represent the transformations.
Then it can be applied lazily like Dask (aka build a compute graph) or applied like D-ranges, a bit like @timotheecour PoC. AFAIK the new C++20 ranges work like this as well.
I do, in Arraymancer: https://github.com/mratsim/Arraymancer/
For NLP there was some wrappers here: https://github.com/Nim-NLP with a focus on NLP on the Chinese language.
For FSM for NLP, I've came across BlingFire from Microsoft research but I guess the most flexible tokenizer is sentencepiece by Google which does unsupervised training and does not assume anything about the language (whitespaces), you can just give it things to read.
Couldn't agree more! A plotting library that can both do a simple
plot(x, y)
And more advanced customization. Are there any maintained tries at a Jupiter kernel one could try to help out?
The path through Python seems like a good one. When Python programmers realize there is a better Cython, things will get fun here ;)
@chemist69 Jupyternim predates hot-code reloading which was written also with jupyter kernel in mind and should be less hacky. No idea though on how to use it in practice.
Docs if someone want to play with it: https://nim-lang.org/docs/hcr.html
@miran In terms of plotting we have nim-plotly, ggplotnim which is written from scratch.
On my side I'm still convinced that the Vega ecosystem is probably one of the best way forward. Especially because they provide an open-source Tableau called Lyra (build with feedback from Tableau people) and most impressively a tool that does automatic suggestions of data visualizations called Voyager
This is the video that sold me on Vega from the OpenVis 2015 conference. Focus on Voyager at 19:15 - https://youtu.be/GdoDLuPe-Wg?t=1155.
I have a PoC of calling Vega lite from Nim here: https://github.com/numforge/monocle but I have no time to work on it for the foreseeable future.
Something else to put on the radar:
"Pyodide brings the Python runtime to the browser via WebAssembly, along with the Python scientific stack including NumPy, Pandas, Matplotlib, parts of SciPy, and NetworkX"
<https://github.com/iodide-project/pyodide>
Maybe a nim language plugin in the future?
<https://iodide-project.github.io/docs/language_plugins/>
And webassembly is not limited to browsers (although browser is the new black):
<https://github.com/intel/wasm-micro-runtime/issues/85> <https://github.com/CraneStation/wasmtime>
In Nim float should be the same as float64. float32 is smaller which means less memory bandwidth needed as well as wider vector instructions in modern CPUs (e.g., on AVX you can fit 8 float32 in one register but only 4 float64). Some neural net people want to use a couple variants of 16-bit (or even 8-bit) floating point formats, while for certain very high precision cases 128-bit is becoming less rare.
So, I don't think there is some perfect choice of size/accuracy and so your SomeFloat idea sounds smarter to me. You might also benefit from having "iterative" numerical routines accept some kind of "precision" parameter that defaults to something (probably dependent upon the concrete floating point type in play), but allows users to maybe trade off speed & accuracy themselves. I.e., if you have some series approximation with an error bound then allow users to pass the max tolerable error (either in absolute terms or relative terms). Nim's named parameter passing with defaults should make it easy to have a common convention in your library.