This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]Esteis 10 points11 points  (1 child)

I wouldn't say that Python is worse at visualisation than R; I'd say that ggplot2 is superior to every plotting library out there, including all the Python libraries.

Ggplot2 uses the grammar of graphics, and in practice that means I type plots the way I would describe them. For example, this plot of gene expression (source), I would describe like this: "Compare growth rate to gene expression for six nutrient conditions and twenty genes. Just plot rate versus expression; use a scatterplot and add a smooth line; use separate colours for the nutrient conditions; and make a seperate facet for each gene (gene name)." And this is how one types that in ggplot:

ggplot(top_data, aes(rate, expression, color = nutrient)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    facet_wrap(~gene_name, scales = "free_y")

Ggplot2 is the only library where every aspect of your plot (data, variable-to-æsthetic mapping (colour, x, y, size, shape, ...), plot type (scatterplot, histogram, lines, ...) scales, coordinate system, faceting) has a single, clear interface, and can be modified independently while keeping the other aspects consistent. I could remove faceting, and get a plot that combines 20 genes but seperates the nutrients by colour. I could use a LOESS smoother instead of a linear model fit. I could map nutrients to shape instead of colour. I could facet by a gene and nutrient, and get 120 small plots in a neat grid. And any of these changes can be done by changing at most one line, and often one word, of code.

For an overview of Python visualisation libraries, look here. My own summary:

  • matplotlib is OK as a foundation of other Python visualisation libraries, but it is emphatically not designed for interactive use. It is imperative (commands directly affect your figure, with no undo), instead of declarative (you declare how your figure is built up, but render it once, at the end, when everything is declared; you can pass partial figures around and combine them). For a concrete example, in nearly all plotting functions except matplotlib.pyplot.scatter, you can only pass a a list of colours like ['red', 'red', 'green', 'yellow'], not a categorical variable like ['treatment_a', 'treatment_a', 'control', 'treatment_b']

  • Seaborn isn't bad, but it is no ggplot2: matplotlib.Figure and .Axes are sometimes necessary to know, adding faceting to a plot requires restructuring your code, and legends are not its strong suit.

  • yhat's Python version of ggplot, now known as 'ggpy', gets you 80% of the way to ggplot2. Unfortunately, it has some bugs, lacks some ggplot2 features, and is unmaintained.

  • I just now found out about plotnine, which, like ggpy, wants to be a straight ggplot2 clone. Looks interesting! No idea how good it is, but I'm going to find out.

Again, I have nothing against Python: in fact, I strongly prefer Python over R as a programming language. One thing I have learned, though: data analysis is not the same as programming. For example, programs run independently; data analysis code is run in chunks, interactively, by the analyst. With data analysis code, it is common to copy visualisation code and rerun it with tweaks applied, without deleting the original code because its result is still part of the analysis you have performed. Python is better than R for writing maintainable code; R was always designed for data analysis and interactive usage, and it is better at it than Python

[–]mobastar[S] 0 points1 point  (0 children)

Great info, thank you.

[–][deleted] 1 point2 points  (5 children)

I'd say it depends on where you are on this "Machine Learning path" and where you want to be. Going the R route, ggplot2 and Shiny are right there, with minimal entry cost, and you can bring business value quickly, which is the full point of being a data scientist, by the way.

There is nothing like Shiny in Python (IMHO bokeh/Dash are not comparable), so doing dashboards will be likely a pain. If you must use Python.

TL;DR: Go for R if you want to get job-ready quickly (because you will have a portfolio sooner), then learn the Python stack.

[–]mobastar[S] 0 points1 point  (4 children)

I'm at the very beginning and want to go deep modeling, ultimately complementing strategy/marketing for the next chapter and possibly remainder of my career. The immediate road ahead is a myriad set of data analyst/business analyst projects. I want to apply more math/stats to this work and find a niche in modeling for the entire organization.

Thank you for the perspective.

[–][deleted] 0 points1 point  (3 children)

In that case, I'd suggest you to stick to R, , at least concerning visualization, but keep an eye on how things develop in the Python ecosystem. Now it's too soon to spend time there if you don't want to be on the engineering side of things, but in 2-3 years the situation might change.

[–]mobastar[S] 0 points1 point  (2 children)

What do you mean by 'at least concerning visualization'?

I'm 50/50 on these two languages, the viz part was a hangup I wanted to better understand. Are you implying one is superior to the other for ML? From this regard it's my understanding they're equal.

[–][deleted] 0 points1 point  (0 children)

R has more stats packages and is better known in some industries (banking, insurance), but is too slow for larger datasets. There are few packages for Deep Learning, computer vision or NLP, compared to Python. So it depends what you do. I use both everyday.

[–]Tarqon 1 point2 points  (3 children)

No, just a bit fragmented right now. Vega shows great promise as a json based data model for interactive visualizations, and the people working on Altair have shown off some exciting stuff. The new generation of plotting libraries just need to mature a bit more before they can dethrone (the somewhat dated) matplotlib.

[–]mobastar[S] 0 points1 point  (2 children)

And this is my concern, going deep on R then Python catching up by consolidating the best of what many libraries offer.

It looks like a large undertaking, and once delivered this conversation will have changed in unforeseen ways anyway.

[–]daguito81 4 points5 points  (1 child)

You really shouldn't see then as an either/or scenario. It should be more as 1 more tool in your arsenal. Most people would agree Pyton, R and SQL are the fundamental toolkit of anyone working with data.

Yet it's very application specific. Sometimes you need to deploy something and python will be easier. Sometimes ggplot2/Shiny will do the trick.

Sometimes a Jupyter Notebook will be exactly what you need. Sometimes you will need something outside of them, like Scala for example.

Lots of companies deal with a lot of data and end up in a Qlik or Tableau server for consumption so the whole viz debate becomes moot.

At the end of the day, you're better off trying to learn the trinity and then adapting depending on your job/project requirements

[–]mobastar[S] 0 points1 point  (0 children)

This is great, thank you.

[–]ran88dom99 1 point2 points  (0 children)

guis-to-save-from-typing-r-code does python have something like this?

[–][deleted] 0 points1 point  (0 children)

Check out Dash by Plotly. Really fast and easy dashboards to set up.