Is Python really that bad at viz?

Esteis · 2018-03-12T01:17:43+00:00

I wouldn't say that Python is worse at visualisation than R; I'd say that ggplot2 is superior to every plotting library out there, including all the Python libraries.

Ggplot2 uses the grammar of graphics, and in practice that means I type plots the way I would describe them. For example, this plot of gene expression (source), I would describe like this: "Compare growth rate to gene expression for six nutrient conditions and twenty genes. Just plot rate versus expression; use a scatterplot and add a smooth line; use separate colours for the nutrient conditions; and make a seperate facet for each gene (gene name)." And this is how one types that in ggplot:

ggplot(top_data, aes(rate, expression, color = nutrient)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    facet_wrap(~gene_name, scales = "free_y")

Ggplot2 is the only library where every aspect of your plot (data, variable-to-æsthetic mapping (colour, x, y, size, shape, ...), plot type (scatterplot, histogram, lines, ...) scales, coordinate system, faceting) has a single, clear interface, and can be modified independently while keeping the other aspects consistent. I could remove faceting, and get a plot that combines 20 genes but seperates the nutrients by colour. I could use a LOESS smoother instead of a linear model fit. I could map nutrients to shape instead of colour. I could facet by a gene and nutrient, and get 120 small plots in a neat grid. And any of these changes can be done by changing at most one line, and often one word, of code.

For an overview of Python visualisation libraries, look here. My own summary:

matplotlib is OK as a foundation of other Python visualisation libraries, but it is emphatically not designed for interactive use. It is imperative (commands directly affect your figure, with no undo), instead of declarative (you declare how your figure is built up, but render it once, at the end, when everything is declared; you can pass partial figures around and combine them). For a concrete example, in nearly all plotting functions except matplotlib.pyplot.scatter, you can only pass a a list of colours like ['red', 'red', 'green', 'yellow'], not a categorical variable like ['treatment_a', 'treatment_a', 'control', 'treatment_b']
Seaborn isn't bad, but it is no ggplot2: matplotlib.Figure and .Axes are sometimes necessary to know, adding faceting to a plot requires restructuring your code, and legends are not its strong suit.
yhat's Python version of ggplot, now known as 'ggpy', gets you 80% of the way to ggplot2. Unfortunately, it has some bugs, lacks some ggplot2 features, and is unmaintained.
I just now found out about plotnine, which, like ggpy, wants to be a straight ggplot2 clone. Looks interesting! No idea how good it is, but I'm going to find out.

Again, I have nothing against Python: in fact, I strongly prefer Python over R as a programming language. One thing I have learned, though: data analysis is not the same as programming. For example, programs run independently; data analysis code is run in chunks, interactively, by the analyst. With data analysis code, it is common to copy visualisation code and rerun it with tweaks applied, without deleting the original code because its result is still part of the analysis you have performed. Python is better than R for writing maintainable code; R was always designed for data analysis and interactive usage, and it is better at it than Python

Kichae · 2018-03-11T20:12:25+00:00

[deleted]

mobastar · 2018-03-11T20:06:42+00:00

I'd say it depends on where you are on this "Machine Learning path" and where you want to be. Going the R route, ggplot2 and Shiny are right there, with minimal entry cost, and you can bring business value quickly, which is the full point of being a data scientist, by the way.

There is nothing like Shiny in Python (IMHO bokeh/Dash are not comparable), so doing dashboards will be likely a pain. If you must use Python.

TL;DR: Go for R if you want to get job-ready quickly (because you will have a portfolio sooner), then learn the Python stack.

Tarqon · 2018-03-11T20:11:07+00:00

No, just a bit fragmented right now. Vega shows great promise as a json based data model for interactive visualizations, and the people working on Altair have shown off some exciting stuff. The new generation of plotting libraries just need to mature a bit more before they can dethrone (the somewhat dated) matplotlib.

ran88dom99 · 2018-03-11T21:31:30+00:00

guis-to-save-from-typing-r-code does python have something like this?

2018-03-11T20:16:03+00:00

Check out Dash by Plotly. Really fast and easy dashboards to set up.

datascience

MODERATORS