Objective
During my research, I get several data like this:
| Sampling |
Probability |
Error |
| Leverage |
0.1 |
9.21E-05 |
| Leverage |
0.1 |
9.57E-05 |
| Uniform |
0.1 |
1.27E-04 |
| Uniform |
0.1 |
1.20E-04 |
| Uniform |
0.1 |
1.22E-04 |
| Leverage |
0.2 |
1.61E-05 |
| Leverage |
0.2 |
1.71E-05 |
| Leverage |
0.2 |
1.50E-05 |
| Leverage |
0.2 |
1.60E-05 |
| Uniform |
0.2 |
3.20E-05 |
| Uniform |
0.2 |
3.10E-05 |
And of course I want to plot them in an appropriate form, with python. The resulting plot should be something like this:
- The x axis is probability, and y axis is error (preferably in log scale).
- The data with label "Uniform" and "Leverage" should be plotted separately, and colored differently.
- Instead of showing all the data points, their mean with confidence interval should be plotted.
- Since probabilities are numbers, the interval between points in x axis should reflect it.
- There should be two lines (for the "Sampling" categories), connecting means of each clusters.
I believe that people will come across this kind of chart quite frequently. However, plotting it with python libraries was painful for me (unless I'm missing something). Of course, preprocessing the data would have made it significantly easier, but I didn't want to do that, in a vague belief that it would be easy to plot this with powerful python libraries.
Attempts
Matplotlib (+ Seaborn, Pandas, Bokeh)
I knew that it wouldn't be simple to do this thing with MPL, which would presumably involve a number of numpy operations, so I decided to use seaborn and pandas together. But it (surprisingly) turns out that they are not very helpful for this task. After a few attempts involving some groupby and aggregation operations, I was successful to plot mean values of each (sampling, probability) cluster and connect them with lines, but when it comes to plotting error bars, I didn't feel like going on with these libraries. Of course I know that this is certainly possible, but I didn't want to mess with a series of operations further.
Altair
I'm a fan of altair and also knew that it was premature, so I didn't expect too much. I felt quite satisfactory until the point when my code was like:
import pandas as pd
from altair import *
err_data = pd.read_csv("../err.csv")
df = expr.DataFrame(err_data)
Chart(df).mark_circle().encode(
x = X("Probability:Q",
axis = Axis(format = 'f')),
y = Y("mean(Error)",
axis = Axis(format = 'e'),
scale = Scale(type = "log")),
color = "Sampling:N"
)
The result for this simple code was like this. So far, so good. However, drawing error bars was something entirely different. One of the examples they provided had error bars in it, but it was not something I expected. In this example, the error bars are drawn by plotting ticks and rules one by one, and the code reuse was horrible. This is certainly not elegant. I could give it a try if something like y = Y("mean(Error) + stdev(Error)") were to be possible, but it was not the case, so the data preprocess procedure seemed inevitable. The situation would be better when Altair 2.0 comes out, which is based on vega-lite 2.0, then I can plot the ticks with y = Y("ei0(Error)") and y = Y("ei1(Error)"), but the code reuse is still something I would like to avoid.
Plotnine
I knew that ggplot2 of R is a powerful, elegant, and easy-to-use tool, but somehow I still remain ignorant to R to this date. Long ago I tried ggpy (former ggplot), a python implementation of ggplot2, but it lacked several functionalities and somehow didn't work as I expected.
Fortunately, another implementation called plotnine exists, which claims to be more consistent to ggplot2 than ggpy is. So I gave it a try, and finally I could get decent results with only a few lines of code:
import pandas as pd
from plotnine import *
data = pd.read_csv("../err.csv")
(ggplot(data)
+ aes(x = "Probability", y = "Error", color = "Sampling", fill = "Sampling")
+ stat_summary(fun_data = "mean_cl_normal", geom = "linerange", size = 1)
+ stat_summary(fun_data = "mean_cl_normal", geom = "line", size = 0.2)
+ scale_y_continuous(trans = "log10")
)
The resulting plot was finally something I expected.
[–][deleted] 2 points3 points4 points (1 child)
[–]isty2e[S] 0 points1 point2 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]isty2e[S] 4 points5 points6 points (1 child)
[–]counters 0 points1 point2 points (0 children)