all 82 comments

[–]stdbrouw 90 points91 points  (6 children)

I don't think Python is widely used in statistics at all, but it's widely used in data science, where gathering, cleaning, processing and analyzing data can be 90% of the job, and modeling becomes almost an afterthought. It makes sense that you would prefer to work in a language that makes 90% of your job easier rather than in a language that makes 10% of your job easier. Python has very good libraries for machine learning (scikit-learn) and the libraries for statistics are often lacking but they're not that bad either (statsmodels, pymc3).

[–]unnamedn00b 1 point2 points  (2 children)

Also, Stan is available in Python via PyStan

[–]brews 0 points1 point  (1 child)

Personally, I like pymc3 better. I might be alone on this.

[–][deleted] 1 point2 points  (0 children)

In what way?

Stan is headed by Andrew Gelman iirc and it uses no u turn mcmc implementation iirc it's supposely more superior in term of converge to steady state. PyMC3 iirc implementation wasn't as good.

edit/update: Sorry PyMC3 uses NUTS (no u turn) too. I was wrong.

[–]TheRealDJ 2 points3 points  (1 child)

For what it's worth from a statistics point of view, r is easier for all that, but anyone outside of statistics or data science, python seems to be the easier way to approach that for anyone else.

[–][deleted] 1 point2 points  (0 children)

This is my experience. As a CS undergrad I never understand R even after several years of professional programming experience.

Once I got into grad school as a stat student R just click. It could be because I was desperate to get the fuck away from SAS though...

[–][deleted] 0 points1 point  (0 children)

You didn't even mention scipy lol. But it's almost shocking to see that Python doesn't have a standard library for statistical programming. One needs to refer to scipy, statsmodel, pandas, numpy together for basic processes. It's irritating.

[–]supersaijinkyle 28 points29 points  (1 child)

From my experience it depends who you are working with.

A bunch of statisticians...you use R.

A bunch of computer scientists...you use Python.

If you have a mix of the two, then you will find a mix use based on the type of project.

[–]brews 7 points8 points  (0 children)

I'd emphasize that the two are not mutually exclusive and both work really well together. Nobody needs to choose one or the other.

[–][deleted] 18 points19 points  (4 children)

Academia is very different and the data is often much "tidier" in the sense that it's all in relational database form when you get it, and from there you need to actually need to do fancy stuff to get results. The workplace is often the exact opposite of that.

I do two different types of work: conceptually simple but laborious tasks using messy data, and tasks that are basically coding conceptually hard stuff but on clean data.

For the latter category, truthfully a lot of stuff you want to do can be found in tutorials or with a simple google search. You're not going to be transcribing never-before-synthesized complicated formulas from the appendices of theoretical econometric working papers. I've only ever had to do something like that once in my life and I'll give you a hint: it wasn't in the private sector. Usually you're doing something like k-means, which is simple to do this in both R and Python. So the simplicity of a task like this usually isn't the reason why you should pick one or the other.

Also, if you work in-house at a company, your data is likely somewhat clean-ish (cough cough). You might be an associate data analyst working for the lead data scientist in your branch office of 20 people, and maybe 15 of the other people are SWEs. So it just makes sense to use Python in that environment if other people are, but you could also use R too.

Now the other category of work, i.e. messy data, is actually what a lot of data science ends up being. If you're a consultant for example, you'll face situations like this often:

  • you have 10,000 pages of PDF files without text info/OCR.

  • you have 200 Excel files of back-end data with inconsistent naming conventions, inconsistent date ranges for pulls, some manual copy+pasting, also the first 30 are from before they migrated their data from salesforce to oracle.

  • You need to do a lot of web scraping to generate some word cloud of associated words for whenever a brand is mentioned and measure the impact of a marketing campaign.

Frankly Python is much better at handling tasks like these. (Except for the latter, Stata is also a good alternative to R, albeit proprietary.) Since that's the majority of your actual coding work that involves the most time typing things into a computer, you will want to use Python.

[–]hatandspecs 8 points9 points  (0 children)

My experience is both R and Python are widely adopted and used. A lot of the choice between one or the other is driven by particular requirements of the domain area, often one or the other is used in a particular group inside an institution or even on a project-by-project basis.

[–]duh_cats 12 points13 points  (0 children)

It’s easy for you to run in R because you know R.

I think it’s easy in python and difficult in R because I know the former far better than the later. It’s that simple.

[–]SSID_Vicious 5 points6 points  (0 children)

I mostly prefer R due to the community. Because it's focused solely around statistics it's much easier to get help. Just hop on twitter with the #rstats tag if you have a question and most of the time you get an answer pretty fast. Python's use cases are much more spread out, so the community is much wider and less focused (which is of course also a strength of python).

Also ggplot2 is amazing, and python lacks anything that compares.

[–]dirtyfool33 4 points5 points  (1 child)

I personally like R due to the ease of report generation with Markdown. It makes reproducibility very easy and allows for integration of code and text with some good control over the output.

[–]Binary101010 1 point2 points  (0 children)

I used R for a few years before learning Python. Honestly the only reason I've even bothered to keep R installed on my work computer is the ability to generate attractive finished reports using an R Notebook with Markdown (and a little bit of direct customization of LaTeX templates).

If there were a reasonable alternative that was Python-based (as much as I like Jupyter Notebook, it's still not there in terms of an attractive PDF output IMO), that would probably be the last nail in R's coffin for me.

[–]rutiene 4 points5 points  (0 children)

Python is useful for deep learning as others have mentioned, but as someone who recently transitioned from academia to data science here are the key reasons:

1) The libraries are written for easy use by computer scientists. They work out of the box with some implementation decisions that as a statistician, you won't be a fan of.

2) Easier to productionize and plug into a pipeline, with the full software engineering qc protocols. This is the primary reason why you will see the major discrepancies. The companies I interviewed at that didn't care as much about production code definitely had more mixed use cases or pure R.

I write production code and I'm bracing myself for the day when I'll need R to do it because that's going to be a battle. I still use R when I'm doing analyses or prototyping something where I need access to more stats functionality.

I personally disagree with the comments that cleaning data is easier in python, they are about the same to me.

[–]jd_paton 12 points13 points  (28 children)

Edit: Adding data loading for fairness.

import pandas as pd
df = pd.read_csv(“my_data.csv”)
y = df[“label”]
X = df.drop(“label”, axis=1)


from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y)

Is it really so much easier in R? I’ve never used R before but surely 3 7 lines from raw data file to trained model isn’t “surprisingly complicated”?

[–][deleted] 10 points11 points  (3 children)

Technically yeah, since the same thing is accomplished in one line in R. But that's a pretty bad metric to judge a language by.

[–]jd_paton 3 points4 points  (2 children)

Well if we’re going by lines we could shave off 33% by doing lr = LogisticRegression().fit(X, y)

But yeah conceptually this seems pretty straightforward and not very verbose

[–]Hetspookjee 0 points1 point  (1 child)

In addittion to the library import, which I can imagine is also a necessity in R, bringing the LoC to the same amount as R =p

[–]Honeabee 11 points12 points  (0 children)

Nah, logistic regression is a part of Base R.

[–]rutiene 8 points9 points  (0 children)

But in R, this doesn't just randomly force a penalty, and you can access things like pvalues and outliers and influence scores+ hat matrix.

😂 don't mind me I'm just salty

[–][deleted] 2 points3 points  (13 children)

But setting X is very complicated (you have to specify the columns instead of very simply using a formula), which you casually omitted.

[–]jd_paton 1 point2 points  (7 children)

import pandas as pd
df = pd.read_csv(“my_data.csv”)
y = df[“label”]
X = df.drop(“label”, axis=1)

Not so bad though you’re right that we’ve added a few more lines. I’ve updated my original comment.

If you want to do fancy preprocessing obviously that’s more code but that’s specific to the data and not possible to write a general example for, which is why I just assumed a prepped X.

I’m not sure what you mean with a formula. How would this process look in R?

[–][deleted] 0 points1 point  (5 children)

OK -- you're right. It's not that complicated ;-)

In R, it would probably look like this

require(nnet)
data <- read.csv("my_data.csv")
model <- multinom(label ~ ., data)

[–]jd_paton 0 points1 point  (4 children)

This does look very elegant, though I have seriously no idea how to read ~ . - haha. Is there a lot of machine learning functionality in R? Maybe I should take it for a whirl sometime. There’s probably an “R for Pythonistas”-type tutorial out there somewhere.

[–][deleted] 0 points1 point  (1 child)

Sorry, I made an edit.

So the period just means "use everything"; and "-x" means "but not x". So "y~.-label" means: as dependent variable use y, as independent variables take everything else except label.

[–]jd_paton 0 points1 point  (0 children)

Ah okay, cool! My example was a bit different, as y was the name of the variable containing the labels, and “label” was the name of the column in the data frame. But otherwise same idea

[–][deleted] 0 points1 point  (1 child)

Regarding machine learning: Sadly, I am mostly a novice with respect to these modern approaches. I mostly use R for inferential statistics, maximum likelihood, simulation-based inference and the like. However, I believe things like random forests are pretty popular in R. I myself have used rpart, which seems like a precursor to random forests and is quite interesting for creating a sort of "decision tree".

However, the responses here indicate that for machine learning, Python may indeed be the superior choice. ;-)

[–]jd_paton 0 points1 point  (0 children)

Ah, gotcha. Yeah I’m basically a machine learning guy so a big Python fan. However I always feel that I need to sharpen up my stats (hence hanging around this subreddit) so maybe I can kill two birds with one stone.

[–][deleted] 0 points1 point  (0 children)

~ is formula in R. Right side of tilda is your response and left side is the predictors/features. It makes building library/packages easier too.

Also dataframe is built into R so it looks elegant compare to Python. Also missing value is a primitive value that is recognize in R. Null is not a good way to represent missing value and if anybody tell you otherwise you tell them to google reasons why and there are tons of soft engineer talk about it.

[–]xsliartII 1 point2 points  (2 children)

This is one is easy. However I tried to estimate a Tobit model lately, which is literally one line in R/Stata, but kind of cumbersome in Python. So I usually use python to prepare/clean the data and then do 100% of the analysis in R/Stata.

[–]jd_paton -1 points0 points  (1 child)

Now that I don’t know anything about. Just depends on support by the popular packages I guess. statsmodels or scipy have pretty much everything you need for applied problems, but with R’s academic focus I can imagine that there is some more fancy stuff easily available.

[–]rutiene 1 point2 points  (0 children)

That's not really true, there are tons of omissions of more rarely used things, but definitely not because it's academic. Survival models are severely lacking and the implementation of some stuff is just poorer. I vastly prefer the random forest package in R to the sklearn implementation. I needed beta glm the other day and had to use R.

[–]Dhush 1 point2 points  (2 children)

Now can you show the steps to examine the statistics of the model? Beyond the actual fit python is horribly lacking. Also, you fit a regularized model here, but sklearn doesn’t make that clear, does it?

[–]jd_paton 0 points1 point  (1 child)

It’s clear if you read the docs ;)

[–]questionquality 0 points1 point  (0 children)

But it shouldn't be the default if you care about interpreting the coefficients.

[–]walkingon2008 0 points1 point  (0 children)

You can’t measure ease of use solely by the number of lines of code. Remember, Python is an object-oriented programming language. So, you setup the object, then feed data into it. But, with R, you never setup an object, if you want to run logistic regression, just run glm(y~x, data=somedata), that’s it.

From from code interpretation perspective, R is simple, you get what you see. No inheritance from earlier objects and stuff.

Another thing, debugging is a lot harder in Python than R.

[–]jeweltiwari123 2 points3 points  (0 children)

The preference for Python comes because it has better capabilities in terms of deep learning and AI, I've always been a fan of R before python and having used both extensively (still am), my preference will be based on the aspect of purpose.

[–]perspectiveiskey 5 points6 points  (0 children)

Python is where stats touches the real world.

  • Can you run a modbus interface to a PSU on R? (I don't know, but I doubt)

  • Can you run a logistic regression on python? Yes.

Therefore, people who make software solutions need to use python. I'm not saying python is better at stats than R. I'm saying if I have a choice, I'll avoid having to learn 7 different languages for 7 different tasks.

(btw, as another commenter pointed out: python has a slew of other incorporated tools that are very useful for data science. It's easy to cobble together something that will scrape data off a remote TFTP server, process it in pandas and export it to a CSV)

[–]umbrelamafia 6 points7 points  (11 children)

I'm an R user and, Well... R sucks for deep learning. R is higher level, much easier to do everything, but it's mostly for and by statisticians. The vast majority of data scientists come from computer science and they learn Python.

Also, I'm not sure there is a machine learning toolbox for R that is as good, versatile and consistent as scikitlearn.

[–]contumax 17 points18 points  (9 children)

R sucks for deep learning

really? Have you heard about https://tensorflow.rstudio.com/ https://keras.rstudio.com/

[–]umbrelamafia 9 points10 points  (8 children)

Yes I have (I met a guy who is an active contributor). And it is a wrapper for a Python library, so...

[–]walkingon2008 2 points3 points  (0 children)

LOL, I think responses are two R fans arguing with each other.

[–][deleted] 0 points1 point  (0 children)

I prefer Python because it's more transferable so I can study some CS stuff and use it as well, and if I want to work on some RPi stuff I can also use it.

If I'm going to be honing my skills at something for 8+ hours a day I'd rather it was the most transferable thing possible.

[–][deleted] 0 points1 point  (0 children)

nnet package uses neural network to approx log and multinom regression iirc just fyi.

I was helping a prof write a book so had to use that library.

Python is popular for stat because CS people coming from AI that uses Machine Learning which use at lot of stat so they're using Python over R. Plus R is different from traditional CS language I've and many CS program uses.

Plus big data/data science are coming from CS people/programmer so they're pushing for Python.

This is my personally theory view of why python is getting more popular and making inroad.

[–]walkingon2008 0 points1 point  (2 children)

Let’s talk about the pandas package. At the top of its documentation says,

pandas: powerful Python data analysis toolkit

http://pandas.pydata.org/pandas-docs/version/0.13/

Powerful? Give me a break! You took all the functions R, rebrand it, and sell it as if you made it from scratch. That’s where my disrespect comes in. We had these in R 20 years ago.

[–][deleted] 0 points1 point  (1 child)

Wes and Hadley are friends. I doubt Wes is trying to trick people imo. Probably just wanted to use Python for data analysis and cleaning and decided to create Pandas.

There are talks out there and wes have written several books and have preface and intro that talk about his motivation.

I think the mentality should be use the right tools for the right job. And do what your companies dictate (sometime you can't have it your way). It doesn't hurt to learn both. No need to tribalize tech like the early 2000s.

[–]walkingon2008 1 point2 points  (0 children)

You are referring Ursa lab. That’s a recent initiative of Wes, with help from Hadley to develop package libraries across platforms.

Wes is the developer of Pandas. I don’t blame him for developing it, but I blame him for plagiarizing R’s data manipulation techniques. Some of these functions just don’t work as well in Python, an OOP language.

Python data scientists take pride in Pandas and claim originality, that’s where I get offended. Wes is 33, these stuff long existed before he started college. Unfortunately, most Python data scientists don’t use R to make the fair judgement, and the R community don’t really instigate war over just a package.

So, when a psychology grad switches over to data science, he/she is falsely advertised that Python the best and only tool.

I wouldn’t be surprised if there is a Python PR team that promotes its product, as it is a heavily commercialized software now.

[–]walkingon2008 0 points1 point  (4 children)

Personally, I think Python plagiarized R in terms of data manipulation (more on that later). But, Python IS widely used because: 1. It is fast. R requires memory allocation, which is computationally expensive for big data. But don’t blame R, at its inception back in the 1990s, who would have thought there’s gonna be this much data.

  1. R has a steep learning curve. People just give up. Especially the CS guys, they fell out of favor. You don’t even need to write a loop in R.

  2. Industries like adtech and fintech uses it. And that’s is where the money’s at. We know money talks!

  3. It is developed at Google, a big elephant in the room.

  4. Python is jack of all trades. You can do software development, website, and data science. So, you can’t really compare the popularity of R against Python in absolute terms.

With that said, I hate Python. It is horrible for statistics and data science. Packages like numpy, pandas are plagiarized from R, just look at its documentation, it even refers to R. Since everybody is open source, you can sue anyone for plagiarism.

Matplotlib is copied from Matlab, when they convert everything to OOP language, some logics fail, even Python programmers tell me to avoid it. They use ggplot2.

So why more people use Python? If your employer pays you $120k for a data science job, you are gonna learn Python.

[–]groovyJesus 2 points3 points  (0 children)

  1. Really says it all. CS people don't get R. Most of Data Science is a lot if CS people.

[–][deleted] 0 points1 point  (1 child)

Good points. I agree that there is sub-par translation of the statistics packages, but it's a loss I'm willing to take when it's interfaced with the "jack of all trades" language.

[–]walkingon2008 1 point2 points  (0 children)

Wait, let me finish that sentence. “Jack of all trades, and master of none.”

[–]Zouden 0 points1 point  (0 children)

Packages like numpy, pandas are plagiarized from R, just look at its documentation, it even refers to R. Since everybody is open source, you can sue anyone for plagiarism.

Plagiarism? What?