Vindaar (orginal) [2020-03-22T18:50:00+01:00] view original

Hey!

As many of you will be aware by now, I started to write a port of ggplot2 some time mid last year:

After many sometimes frantic sessions working on this, I'm finally approaching a first personal milestone: Essentially all features I consider essential for a plotting library (for my personal use cases!) are (or are about to be) implemented. This will mark the release of version v0.3.0.

The remaining features I will implement in the next few days are:

geom_density: to create smooth density estimates of continuous variables using kernel density estimation (KDE). I've implemented a naive KDE with complexity O(m x n) for testing and it works very well (but it's very slow obviously). I want to improve that before merging it. If anyone has a good resource for a simple to implement but reasonably performant KDE implementation / algorithm, feel free to post it!

geom_ridgeline: ridgeline plots (or joyplots) are fun and pretty! Should be straightforward to implement.

re-activate facet_wrap: facet_wrap has been dormant for a few months now, because an internal rewrite broke them at some point. The implementation is there, but I need to fix the layouting, which is even more broken now than before. But that should also be fairly easy.

Now, the main reason I open this topic is to ask all of you about what I should focus on once the above is done.

Possible things to work on

There are several ideas I have in my mind, but definitely not the time to tackle them at the same time. They are:

properly implement the Vega-Lite backend

One of the main goals I had in mind when starting this whole project was to provide two different plotting backends. One native target to produce plots locally, fast and statically.

On the other hand, originally inspired by @mratsim's monocle, a Vega-Lite backend to scratch that interactive / web based itch, which allows for easy sharing of plots including data!

I wrote a proof of concept and by now I have a pretty good idea (barring a lack of Vega experience) on how to implement this.

Essentially the whole processing of the plot as is done now remains the same. This allows to make use of the whole functionality of ggplotnim without having to do a lot of duplication. The drawing code will be replaced by a mapping to JSON instead.

The major work would be involved defining said mapping. If I'm lucky I can even write it as a ginger backend with a - for Vega pretty obscure - API (drawPoint, drawLine, etc. essentially just adding data to a JsonNode). More likely it'll involve replacing the drawing portion of ggplotnim Vega related drawing equivalents.

improve `DataFrame` performance

The included data frame in ggplotnim is - for many operations anyways - abysmally slow.

While performance is nice, I mainly wanted something to work with "right now" instead of spending a lot of time writing a performant data frame.

The reasons for the performance are three-fold, as far as I can tell:

for some operations the algorithms used are inefficient

the underlying data type is a Value similar to a JsonNode. Conversion to and from normal types is slow and operations on Value are also slow, since there are always case statements involved and at least one indirection to access the actual value.

each column is a PersistentVector[Value]. For most operations this is a major performance boost over a seq[Value], since we avoid a large amounts of copying. However, iterating over long vectors or building long vectors is slow.

One thing to improve performance would be to include the distinction between a pure column of one data type and Value columns (which are somewhat similar to object types in numpy / pandas if my superficial understanding of those is correct).

While I'm not certain, I believe that distinction alone would make the code a lot more complex and would definitely require a lot of use of generics. Generics is something I specifically wanted to avoid in context of a data frame, because each time I played around with toy data frames this became a headache.

The only idea to avoid generics would be to extend a Value to also have a case for vector like data, similar to JsonNodes JArray. That would double the number of fields though.

In any case, if I were to seriously attempt to improve performance of the data frames, I would stop messing around myself and first do some research into how data frames are handled elsewhere.

Again, if anyone is familiar with resources, feel free to share them!

improve documentation

This is pretty self explanatory. The main documentation is definitely lacking as it is right now.

I hope that the recipes provide anyone of you who tried to play around with the library with a reasonable alternative for the time being!

There's a lot of functionality in ggplot2 that makes it a proper R package. Namely a lot of stats related functionality. Simple things like box and violing plots, smoothing and error bands and probably a lot more I'm not even aware of, since I don't really use that stuff.

If that's something people want, I could defnitely work on that.

At least box and violing plots and simple loess smoothing is something I'll implement at some point anyways. If there's something else you consider essential, let me know.

write a shared library for use in C / C++

This is a fun idea I had a while back. As far as I'm aware our poor fellows who are stuck working with C and C++ don't really have a great plotting library to work with.

Some of them are very powerful but hard to use, some produce not very nice looking plots and some others (looking at you ROOT) bring along an oil tanker of dependencies.

Maybe I'm missing something, but as far as I can tell there are many people who do their calculations in C/C++, dump the data and use python for plotting.

I'm not sure if people would be interested in such a thing, but Nim being awesome would allow for a shared library to essentially make use of all of ggplotnims functionality from C. I wrote a small scatter plot function and it worked perfectly.

Maybe this wouldn't work out as well as I think right now, but it feels like maybe a great opportunity for Nim to shine.

add `DateTime` support

Currently dates and times are not supported in ggplotnim. I'm being made aware of this every day at the moment, thanks to many COVID-19 plots I see daily, which have dates on their x axis…

This is definitely a must have, but I haven't really thought about how I want to implement this.

attempt to allow typsetting text using LaTeX

Once the time comes for me to write my thesis, I will probably want more power over the type setting on the plots, especially to choose arbitrary fonts and set equations etc.

One main feature matplotlib still has over ggplotnim for me is to easily put LaTeX onto plots.

This is something I will attempt to implement at some point this year. I don't yet know how to best do it, but I have a few ideas.

I could go crazy and write a tikz backend for ginger I suppose. I'm not familiar enough with tikz to know how flexible it is, but it seems doable.

Or I could split the non text based plot and the text based stuff into two outputs, dump the text into a LaTeX template, compile it and merge the two files.

I'll figure something out.

something else entirely?

If you have any other ideas or maybe I'm missing something important, feel free to let me know. Either post it here, or open an issue on the repository.

To sum it all up

Sorry for rambling so long. ;)

In general I want to encourage anyone who tries out ggplotnim to feel free to open issues on the repository freely. Please don't think it's strictly for bug reports. If you struggle using the library chances are I'm at fault. Either the documentation sucks, you're using it in ways I didn't foresee, which may thus be cumbersome, etc. I'll try to help out as best as I can!

Thanks for reading!

I'll use this thread to update you on releases and post changelogs in the future.

kaushalmodi (orginal) [2020-03-22T20:35:24+01:00] view original

ggplotnim is an amazing library and @Vindaar's support has been excellent! I have been using it in a convoluted (pun intended) way -- using to plot FFT's of analysis done in MATLAB via its exported-to-C libs importc'd into Nim.

Lecale (orginal) [2020-03-23T11:04:57+01:00] view original

I've made a note to check this out. So far my usage of nim in work has been limited to command line scripts, I want to have an excuse to expand that scope.

cdome (orginal) [2020-03-23T14:39:40+01:00] view original

Superb library highly recommended

didlybom (orginal) [2020-03-24T10:59:27+01:00] view original

This is awesome! I have not really used nim for signal processing or data science work yet, but I think it would work very well with the right libraries, and this is one of the libraries that could make that happen.

Is this compatible with other libraries, such as arraymancer, etc? I think that one of the biggest strengths of the python numerical ecosystem is the good inter-operability of most plotting libraries with numpy. So if that is not already the case I would suggest making that your highest priority.

Other than that, I didn't see mention of support for contour plots in the docs. It is surprising how often those come in handy in many scenarios so I'd like for you to add that if it is not available yet. Another thing I like to do is to combine line plots with histograms and/or kernel density plots on the X and Y axis (to get a quick idea of the distribution of the values, particularly in time series). It would be neat to support for that too.

Finally, in signal processing work you are often working with complex samples. In that context it is often handy to plot the I and Q components vs time, placing 2 subplots on top of each other, and _linking the X (time) zoom/pan of the two subplots. It would be really nice if that were supported.

spip (orginal) [2020-03-24T15:59:57+01:00] view original

@Vindaar ggplotnim is a very nice library!

Is this compatible with other libraries, such as arraymancer, etc? I think that one of the biggest strengths of the python numerical ecosystem is the good inter-operability of most plotting libraries with numpy. So if that is not already the case I would suggest making that your highest priority.

I second that and using a common DataFrame. numpy is the foundation of Python data libraries with pandas but we haven't such a recognized data foundation with nim. Having a coordinated effort would benefit many scientific nim projects.

See the Data Science section for other DataFrame libraries.

Vindaar (orginal) [2020-03-24T18:54:49+01:00] view original

@spip I'll answer your question below aswell.

Is this compatible with other libraries, such as arraymancer, etc? I think that one of the biggest strengths of the python numerical ecosystem is the good inter-operability of most plotting libraries with numpy. So if that is not already the case I would suggest making that your highest priority.

The answer to that is "sort of". I'll need to explain a little to answer the why and what I mean by "sort of".

The long answer

Originally when I started the library I never planned to write a data frame library to go with this. I quickly realized however that (at least with a library like ggplot2) one doesn't work well without the other. In a normal plotting library every plotting function is a special case. Essentially each kind of plot wants data in a specific form / of a specific data type.

So in the beginning I specifically didn't want to use arraymancer internally. I love that library, but given that all I wanted to write was a "plotting library", this meant two things specifically for me:

The library is essentially a sink for the user's data. It doesn't return anything, so there's no reason for the internal data type to conform to any standards

If a user wants to create a plot, performance will not be an important consideration (which does not imply performance of a plotting library does not matter!). Creating a plot will always be slow (compared to pure number crunching anyways). There are use cases for libraries, which can create plots at several hundred fps. But to be honest, if I need to create a huge number of plots and am thus performance sensitive, the question is if a plotting library is the best tool in the first place.

For this reason I decided to avoid having arraymancer as a dependency, because all its strengths are mostly useless for the intended purpose, but would mean I introduce an unnecessary dependency.

If a user is using arraymancer for calculations, it's easy to convert the required data to ggplotnim's data types. I felt the overhead of copying the data was not a big deal under the assumption mentioned above.

But, things did somewhat change when I started to write the data frame.

My first idea was actually to use NimData, since I really like library. However, the (depending on viewpoint) advantage / disadvantage that their type is entirely defined via a schema at compile time, didn't appeal to me. I didn't want to end up with a ggplot2 clone that was super restrictive, because everything had to be known at compile time.

I was actually hoping that @bluenote would pick up his development of Kadro again:

https://github.com/bluenote10/kadro

That sounded perfectly suited. But since he didn't, I simply started to hack together something that suits the needs of the library.

Originally in fact the DataFrame type was generic and my goal was to write the code in such a way that the underlying type does not matter. This made things complicated though. In fact I even thought about an arraymancer backend from the start:

https://github.com/Vindaar/ggplotnim/blob/master/playground/arraymancer_backend.nim

which however never progressed from there. Mainly because I couldn't figure out how to make use of arraymancer's performance, when majority of data frame operations I did ended up copying around data. Which is how I ended up with @PMunch's persistent vector from Clojure. It kind of allowed me to "copy as much as I want" without the performance penalty.

This is how we got to the current situation. The data frame is okayish fast for simple things to prepare a plot. Anything else, I can't recommend it (also because it's extremely lenient on types!).

tl;dr

Compatibility with the "rest of the ecosystem" isn't there for practical reasons.

The thing is I'd love to profit from @mratsim's amazing work on arraymancer and laser!

Once I go back and reconsider performance of the data frame, I hope I will end up using as much of arraymancer as I can to be honest. I just need to figure out how to do it. :)

Other than that, I didn't see mention of support for contour plots in the docs. It is surprising how often those come in handy in many scenarios so I'd like for you to add that if it is not available yet. Another thing I like to do is to combine line plots with histograms and/or kernel density plots on the X and Y axis (to get a quick idea of the distribution of the values, particularly in time series). It would be neat to support for that too.

Good point. Contour plots are something I simply didn't think about.

I've never actually thought about how those are implemented before. I guess it's just a 2 dimensional KDE, right?

Since I will be implementing geom_density, for which I need KDEs anyways, I might as well implement N dimensional KDEs. That makes performance an even bigger issue though.

This is a case where implementing this in arraymancer would defintely be helpful and then just pass the required data from a DF to arraymancer. Maybe in a few months time, we can just write:

import ggplotnim, arraymaner
let df: DataFrame = someDataFrame()
let dfKde = df.kde("x", "y", "z")

where kde would be an arraymancer proc; or something along those lines…

Finally, in signal processing work you are often working with complex samples. In that context it is often handy to plot the I and Q components vs time, placing 2 subplots on top of each other, and linking the X (time) zoom/pan of the two subplots. It would be really nice if that were supported.

When you say "on top of each other". Do you mean essentially a plot with a secondary axis? These are already supported, but are somewhat limited right now.

There's no recipe of these at the moment though. An example:

https://gist.github.com/Vindaar/5292d4d9b8fb667e3eb27061627dbbfe#gistcomment-3225761

The downside of secondary axes at the moment is that it is just a fake axis. It is ticks and labels on the RHS of the plot, but the underlying data is still drawn into the coordinate system defined by the main axis.

I know that ggplot2 explicitly does not allow completely independent axes (only those which can be calculated from one another, e.g. unit conversions), because Hadley Wickham thinks other cases are easily misleading. And to an extent I agree. However, I do think there is a place for them, so I will provide better support for them in the future.

real subplots

Or do you mean a normal subplot consisting of several (in principle not connected) plots in a single graphic? Those are also supported, but their use is not perfectly nice yet. One has to make use of ginger functionality directly.

An example inspired by: https://staff.fnwi.uva.nl/r.vandenboomgaard/SP20162017/SystemsSignals/plottingsignals.html

import ggplotnim, seqmath, math, sequtils, complex, ginger
let t = linspace(-0.02, 0.05, 1000)
let y1 = t.mapIt(exp(im(2'f64) * Pi * 50 * it).re)
let y2 = t.mapIt(exp(im(2'f64) * Pi * 50 * it).im)
let df = seqsToDf({ "t" : t,
                    "Re x(t)" : y1,
                    "Im x(t)" : y2 })
let plt1 = ggcreate(
  ggplot(df, aes("t", "Re x(t)")) +
    geom_line() +
    xlim(-0.02, 0.05) +
    ggtitle("Real part of x(t)=e^{j 100 π t}"),
  width = 800, height = 300
)
let plt2 = ggcreate(
  ggplot(df, aes("t", "Im x(t)")) +
    geom_line() +
    xlim(-0.02, 0.05) +
    ggtitle("Imaginary part of x(t)=e^{j 100 π t}"),
  width = 800, height = 300
)
# combine both into a single viewport to draw as one image
var plt = initViewport(wImg = 800, hImg = 600)#wImg = 800.0, hImg = 800)
plt.layout(1, rows = 2)
# embed the finished plots into the the new viewport
plt.embedAt(0, plt1.view)
plt.embedAt(1, plt2.view)
plt.draw("real_imag_subplot.pdf")

Which produces the following plot: https://gist.github.com/Vindaar/5292d4d9b8fb667e3eb27061627dbbfe#gistcomment-3225762

Another example can be found here: https://gist.github.com/Vindaar/fc158afbc75627260aed90264398e473

If you have something else in mind, let me know!

didlybom (orginal) [2020-03-25T09:35:10+01:00] view original

Thank you for your detailed explanation of the current situation and how you got there. I hope you can find a way to improve the interoperability story of this promising library.

Regarding the performance of the library I’d say that to me the most important requirement is to be able to plot huge amounts of data. Ideally, hundreds of thousands, even millions of points should not be a problem and should not take more than a few seconds to plot (the less time the better of course :-) ). Being able to update the plot a few times per second (with a reasonably small dataset and simple plot) would be nice too (but not need to achieve tens of FPS or something like that IMHO).

As for the “complex signal” plot, I was referring to a plot with two separate subplots (place one above the other) where the Y axis zoom is independent but the X-axis zoom and pan are linked. The idea is that if I want to view the real component in the range of the 3000th to the 5000th sample I’ll also like to view the imaginary component in the same range.

Ideally I’d like to be able to simply pass a complex number data frame to the plotting functions and the library would just know what to do with it (i.e. the creation of the 2 subplots would be done implicitly and automatically).

I sometimes need to plot the real and imaginary components of a complex signal on same plot, so supporting that would also be nice but I don’t need it as often.

Araq (orginal) [2020-03-25T20:25:04+01:00] view original

Would https://github.com/Araq/packedjson or the ideas it uses be of help for a common ground for data processing?

With ARC the data doesn't have to copied between threads and without ARC the copy is a single allocation + copyMem, quite cheap.

Vindaar (orginal) [2020-03-26T20:28:53+01:00] view original

So I did a thing today… (which is why I haven't answered yet).

This morning I took another look at a rewrite of the DataFrame using an arraymancer backend. Turns out by rethinking a bunch of things and especially the current implementation of the FormulaNode, I managed to come up with a seemingly working solution.

This is super WIP and I've only implemented mutate, transmute and select so far, but first results are promising.

Essentially the FormulaNode from before is now compiled into a closure, which returns a full column.

So the following formula:

f{"xSquared" ~ "x" * "x"}

will assume that each string is a column of a data frame and create the following closure:

proc(df: DataFrame): Column =
  var
    colx_47075074 = toTensor(df["x"], float)
    colx_47075075 = toTensor(df["x"], float)
    res_47075076 = newTensor[float](df.len)
  for idx in 0 ..< df.len:
    []=(res_47075076, idx, colx_47075075[idx] * colx_47075074[idx])
  result = toColumn res_47075076

The data types for the columns and the result data type are currently based on heuristics given things that appear in the formula. E.g. if math operators appear it's float, if boolean operators it's bool etc.

The data frame now looks like:

DataFrame* = object
  len*: int
  data*: Table[string, Column]
  case kind: DataFrameKind
  of dfGrouped:
    # a grouped data frame stores the keys of the groups and maps them to
    # a set of the categories
    groupMap: OrderedTable[string, HashSet[Value]]
  else: discard

where a Column is:

Column* = object
  case kind*: ColKind
  of colFloat: fCol*: Tensor[float]
  of colInt: iCol*: Tensor[int]
  of colBool: bCol*: Tensor[bool]
  of colString: sCol*: Tensor[string]
  of colObject: oCol*: Tensor[Value]

colObject is the fallback for columns, which contain more than one data type.

So I only wrote a super simple for loop to get a rough idea how fast/slow this might be:

import arraymancer_backend
import seqmath, sequtils, times
#import ggplotnim # for comparison with current implementation

proc main(df: DataFrame, num: int) =
  let t0 = cpuTime()
  for i in 0 ..< num:
    df = df.mutate(f{"xSquared" ~ "x" * "x"})
  let t1 = cpuTime()
  echo "Took ", t1 - t0, " for ", num, " iter"

proc rawTensor(df: DataFrame, num: int) =
  var t = newTensor[float](df.len)
  let xT = df["x"].toTensor(float)
  let t0 = cpuTime()
  for i in 0 ..< num:
    for j in 0 ..< df.len:
      t[j] = xT[j] * xT[j]
  let t1 = cpuTime()
  echo "Took ", t1 - t0, " for ", num, " iter"

when isMainModule:
  const num = 1_000_000
  let x = linspace(0.0, 2.0, 1000)
  let y = x.mapIt(0.12 + it * it * 0.3 + 2.2 * it * it * it)
  var df = seqsToDf(x, y)
  main(df)
  rawTensor(df)

Gives us: new DF:

Took 9.570060132 for 1000000 iter

raw arraymancer tensor:

Took 1.034196647 for 1000000 iter (so still some crazy overhead!)

While the old DF took 23.3 seconds for only 100,000 iterations! So about a factor 23 slower than the new code.

Probably really bad comparison with pandas:

import numpy as np
import pandas as pd
x = np.linspace(0.0, 2.0, 1000)
y = (0.12 + x * x * 0.3 + 2.2 * x * x * x)

df = pd.DataFrame({"x" : x, "y" : y})
def call():
    t0 = time.time()
    num = 100000
    for i in range(num):
        df.assign(xSquared = df["x"] * df["x"])
    t1 = time.time()
    print("Took ", (t1 - t0), " for 1,000,000 iterations")
call()

Took 60.24467134475708 for 100,000 iterations I suppose using assign and accessing the columns like this is probably super inefficient in pandas?

And a (also not very good) comparison with NimData

import nimdata

import seqmath, sequtils, times, sugar

proc main =
  let x = linspace(0.0, 2.0, 1000)
  let y = x.mapIt(0.12 + it * it * 0.3 + 2.2 * it * it * it)
  var df = DF.fromSeq(zip(x, y))
  df.take(5).show()
  echo df.count()
  
  const num = 1_000_000
  let t0 = cpuTime()
  for i in 0 ..< num:
    df = df.map(x => (x[0], x[0] * x[0])).cache()
  let t1 = cpuTime()
  echo "Took ", t1 - t0, " for ", num, " iter"

when isMainModule:
  main()

Took 16.322826325 for 1,000,000 iter

I'm definitely not saying the new code is faster than NimData or pandas, but it's defintely promising!

I'll see where this takes me. I think though I managed to implement the main things I was worried about. The rest should just be tedious work.

Will keep you all posted.

Vindaar (orginal) [2020-04-01T18:50:37+02:00] view original

Some simple benchmarks comparing the new backend to pandas at:

https://github.com/Vindaar/ggplotnim/tree/arraymancerBackend/benchmarks/pandas_compare

Note that I ran the code on a default pandas installation on my void linux, without blas. But I also compiled the Nim code without blas support.

It's just a port of a pandas / numpy comparison from here:

https://github.com/mm-mansour/Fast-Pandas

All in all the new backend (let's call it datamancer from now on, heh) is significantly faster for all operations, which essentially just rely on @mratim's work.

For a few others, specifically unique and sorting, it's slightly slower. But given the implementation of those I'm actually rather happy with that.

And especially for small data frame sizes the function call / looping overhead python has to bear is ridiculous.

I'll focus on finishing up the open PR (ridgelines and a bit more) and then finish this.

mratsim (orginal) [2020-04-03T16:04:35+02:00] view original

Note that you can use laser's forEach without Arraymancer on any type as long as you implement the following routines: https://github.com/numforge/laser/tree/master/laser/strided_iteration

rank, size

shape, strides

unsafe_raw_data

is_C_contiguous

just copy-paste or submodule into you code, Laser is more of a research repo right now.

Compatibility with Arraymancer will be much easier once the following PR is merged, and Arraymancer can use raw buffers coming from any libraries (including Numpy) without copy: https://github.com/mratsim/Arraymancer/pull/420

It's unfortunately blocked by a Nim bug in the GC.

Vindaar (orginal) [2020-04-03T19:05:05+02:00] view original

Thanks, maybe I'll give it a try to include it manually into the repo!

improve performance and usability on complex apply/map

It will definitely help, but I'm already creating a single loop for each formula, no matter how many tensors are involved.

E.g.

let df = ...# some DF w/ cols A, B, C, D
df.mutate(f{"Foo" ~ `A` * `B` - `C` / `D`})

will already be rewritten to:

var
    col0_47816020 = toTensor(df["A"], float)
    col1_47816021 = toTensor(df["B"], float)
    col2_47816022 = toTensor(df["C"], float)
    col3_47816023 = toTensor(df["D"], float)
    res_47816024 = newTensor[float](df.len)
  for idx in 0 ..< df.len:
    []=(res_47816024, idx, col0_47816020[idx] * col1_47816021[idx] - col2_47816022[idx] /
        col3_47816023[idx])
  result = toColumn res_47816024)

which is indeed a little slower than a manual map_inline, but still pretty fast. Compare the first plot from here:

https://github.com/Vindaar/ggplotnim/tree/arraymancerBackend/benchmarks/pandas_compare

Not sure where the variations map_line sees are coming from though. Effects of openmp?

Small aside about the types

The data types are determined as floats from the usage of *, / etc. Could be overridden by giving type hints:

f{int -> float: ...}
  ^--- type of involved tensors
         ^---- type of resulting tensor

AFAIK it should would allow combining complex transformations and do them in a single pass instead of allocating many intermediate dataframes so performance can be an order of magnitude faster on zip/map/filter chains.

While this is certainly exciting to think about, I think it'd be pretty hard to (for me in the near future anyways) achieve while:

keeping it simple to extend the library by adding new procs

still allowing usage of the procs in a normal way as to return a new DF (without having differently named procs for inplace / not inplace variants).

But this is just me speculating from the not all that simple code of zero-functional. I guess having a custom operator like it does would allow us to replace the user given proc names though.

If you have a better idea of how to do efficient chaining that seems reasonable to implement, I'm all ears.

what I'm working on

Right now I'm rather worrying about having decent performance for group_by and inner_join though. I'm looking at https://h2oai.github.io/db-benchmark/ since yesterday. It's a rather brutal reality check, hehe.

Comparing my current code with the first of the 0.5 GB group_by examples to pandas and data.table was eye opening. In my current implementation of summarize for grouped data frames I actually return the sub data frames for each group and apply a simple reduce operation based on the users formula. Well, what a surprise, that's slow. I haven't dug deep into data.table of pandas yet, but as far as I can tell they essentially special case group_by + other operation and handle these by just aggregating on all groups in a single pass.

So I've implemented the same and even for a single key with a single sum I'm 2 times slower than running the code with pandas on my machine. To be fair, performing operations on sub groups individually is a nice 100x slower than pandas.

Still, the biggest performance impact I have to make is in order to allow columns with multiple data types to group by. I need some way to check which subgroup a row belongs to. Since I can't create a tuple at runtime, in order to just use normal comparison operators I decided to calculate a hash for each row and compare that. That works well, but gives me that 2x speed penalty.

For the time being though, I think I'm happy with that unless I have a better idea / someone can point me to something that works in a typed language and doesn't involve huge amount of boilerplate code.

So I'm currently working on an implementation that allows to use user defined formulas for aggregation while not having to call a closure for each row.

Vindaar (orginal) [2020-04-07T18:44:26+02:00] view original

Ok, so I just merged the arraymancer backend PR, which includes the PR for version v0.2.0.

v0.2.0 was mainly ridgeline plots and scale_*_reverse. Note that due to there is currently no recipe for a ridgeline plot. That will be added in the next few days. Also they are not as nice as they should be (essentially the top ridge doesn't change its height depending on the max values in the ridge if overflowing of ridges into one another is allowed).

scale_*_reverse just allows to reverse scales as the name suggests.

Aside from that a few smaller things were added (theme_void) and a few recipes that use geom_tile (annotated heatmap and plotting the periodic table).

I'm not entirely happy with the state of version v0.3.0 though, since the formula mechanism introduces several breaking changes. Arguably reading formulas is now clearer (see the beginning of the README and especially the recipes, since they all have to be compliant with the new mechanism!), but it still requires code to be changed.

I think the amount of breakage is probably not that large, since not that many people will have used formulas for things anyways yet. Also because the DF was discouraged before, since it was slow.

Simple formulas e.g. f{"hwy"} remains unchanged anyways, same as f{5} to set some constant value to an aesthetic. But for these things formulas were previously only required for numbers and not referring to columns, since the aes proc took string | FormulaNode. Now also numbers are actually supported, so to set some constant value, you can just do aes(width = 0.5) instead of aes(width = f{0.5}).

In any case, I wanted to get this PR off my chest, since it was way too large. I tried to avoid breaking changes as much as possibly by macro magic, but this issue:

https://github.com/nim-lang/Nim/issues/13913

was the nail in the coffin. So I just release it now.

Feel free to open issues, because I broke your code. :)

Vindaar (orginal) [2020-04-26T13:48:33+02:00] view original

I'm happy to say that facet_wrap is finally back with version v0.3.5.

Normal classification by a (in this case 2) discrete variable(s):

Classification by discrete variable with free scales:

See the code for these two here: https://github.com/Vindaar/ggplotnim/blob/master/recipes.org#facet-wrap-for-simple-grid-of-subplots

Other notable changes of the last few versions include:

all recipe plots are now also checked in the CI based on the JSON representation of the final Viewport, which drawn by ginger

bar plots can now show negative bars

gather on the arraymancer backend does not require all columns to be of the same type anymore

ridgeline plots were added. There's no recipe yet, because one thing still has to be fixed: the size of the top most ridge is not scaled if the content (using an overlap > 1 is used) exceeds the size of the ridge. With the changes done for the facet_wrap fix however, this is finally possible to implement.

See the full changelog for all recent changes:

https://github.com/Vindaar/ggplotnim/blob/master/changelog.org

didlybom (orginal) [2020-04-27T09:08:35+02:00] view original

Do you have instructions on how to properly package a program that uses this library? I managed to get ggplotnim working on my windows system, but it was harder than I’d like. I downloaded the gtk libraries and placed them on the executable folder but that did not work. In the end what worked was copying the dll of another gym based program (Inkscape) into the ggplotnim-based executable. It’s be nice if the library came with some tool that helped you collect the libraries it needs or if there was a way to create stand-alone executables.

Vindaar (orginal) [2020-04-27T10:17:02+02:00] view original

Sorry about that. When I started writing this I had no idea cairo would be such a pain on Windows.

There's an issue about it here: https://github.com/Vindaar/ggplotnim/issues/57

I haven't updated the README yet, mostly because I don't have a good solution either yet. The easiest for me on a practical level was to just install emacs and add it to my PATH (which is I guess equivalent to you using the Inkscape libraries).

I guess I can think about either adding working versions of the required libraries to the repository for windows (at least win64) or a script which clones the cairo repository and builds it locally. I haven't built cairo locally yet, so I don't know if it works well.

Now regarding your actual question. If you want to ship a program, which uses ggplotnim internally, you have to do what people do on Windows as far as I know: bundle all required DLLs with the program.

The other alternative would be a static build of cairo. I'll see what I can do to improve the situation. Thanks for the input!

mantielero (orginal) [2020-04-27T10:44:36+02:00] view original

Couldn't render post #38646.

Vindaar (orginal) [2020-04-28T11:35:20+02:00] view original

I wasn't aware of the GR framework. I certainly looks interesting. However, it does not look more light weight than cairo. Just having Qt as a dependency is an immediate no-go to me. At least for a default backend (unless I'm missing something and you can easily get both binaries w/o Qt dependency and build it w/o it).

Also it obviously does a lot more than cairo. It's a full fledged visualization library.

For ggplotnim's purposes the only advantage it would have would be access to more backends, as far as I can see.

Adding a new backend to ginger is in principle as easy as providing these procs:

https://github.com/Vindaar/ginger/blob/master/src/ginger/backendDummy.nim

And see the actual cairo backend:

https://github.com/Vindaar/ginger/blob/master/src/ginger/backendCairo.nim

So feel free to add a new GR backend to ginger if you'd like!

To me the most important features I want from backends are:

png, pdf support: provided by cairo already

LaTeX handling of labels / text: will be done via a tikz backend (good to see that apparently GR is going that route for LaTeX too!)

an interactive viewer: not implemented, but can also be done via cairo. The more challenging aspect is writing the logic that allows for updates in the first place (and if possible incremental updates of the plot, but that's hard with the current implementation I think)

a Vega backend: well, has to be done by writing a Vega backend

I can totally see how GR can be a great library to build a powerful visualization library, if being used from the onset. It seems to take care of a lot of annoying details I had to get right.

Mirror of forum.nim-lang.org

6105 :: ggplotnim - pretty native plots for us

Possible things to work on

properly implement the Vega-Lite backend

improve `DataFrame` performance

improve documentation

write a shared library for use in C / C++

add `DateTime` support

attempt to allow typsetting text using LaTeX

something else entirely?

To sum it all up

Mirror of forum.nim-lang.org

6105 :: ggplotnim - pretty native plots for us

Possible things to work on

properly implement the Vega-Lite backend

improve DataFrame performance

improve documentation

implement more statistics related ggplot2 functionality

write a shared library for use in C / C++

add DateTime support

attempt to allow typsetting text using LaTeX

something else entirely?

To sum it all up

improve `DataFrame` performance

add `DateTime` support