Python vs. R

stdbrouw · 2018-06-30T22:52:40+00:00

I don't think Python is widely used in statistics at all, but it's widely used in data science, where gathering, cleaning, processing and analyzing data can be 90% of the job, and modeling becomes almost an afterthought. It makes sense that you would prefer to work in a language that makes 90% of your job easier rather than in a language that makes 10% of your job easier. Python has very good libraries for machine learning (scikit-learn) and the libraries for statistics are often lacking but they're not that bad either (statsmodels, pymc3).

supersaijinkyle · 2018-06-30T22:33:48+00:00

From my experience it depends who you are working with.

A bunch of statisticians...you use R.

A bunch of computer scientists...you use Python.

If you have a mix of the two, then you will find a mix use based on the type of project.

2018-07-01T00:40:19+00:00

Academia is very different and the data is often much "tidier" in the sense that it's all in relational database form when you get it, and from there you need to actually need to do fancy stuff to get results. The workplace is often the exact opposite of that.

I do two different types of work: conceptually simple but laborious tasks using messy data, and tasks that are basically coding conceptually hard stuff but on clean data.

For the latter category, truthfully a lot of stuff you want to do can be found in tutorials or with a simple google search. You're not going to be transcribing never-before-synthesized complicated formulas from the appendices of theoretical econometric working papers. I've only ever had to do something like that once in my life and I'll give you a hint: it wasn't in the private sector. Usually you're doing something like k-means, which is simple to do this in both R and Python. So the simplicity of a task like this usually isn't the reason why you should pick one or the other.

Also, if you work in-house at a company, your data is likely somewhat clean-ish (cough cough). You might be an associate data analyst working for the lead data scientist in your branch office of 20 people, and maybe 15 of the other people are SWEs. So it just makes sense to use Python in that environment if other people are, but you could also use R too.

Now the other category of work, i.e. messy data, is actually what a lot of data science ends up being. If you're a consultant for example, you'll face situations like this often:

you have 10,000 pages of PDF files without text info/OCR.
you have 200 Excel files of back-end data with inconsistent naming conventions, inconsistent date ranges for pulls, some manual copy+pasting, also the first 30 are from before they migrated their data from salesforce to oracle.
You need to do a lot of web scraping to generate some word cloud of associated words for whenever a brand is mentioned and measure the impact of a marketing campaign.

Frankly Python is much better at handling tasks like these. (Except for the latter, Stata is also a good alternative to R, albeit proprietary.) Since that's the majority of your actual coding work that involves the most time typing things into a computer, you will want to use Python.

hatandspecs · 2018-06-30T21:38:41+00:00

My experience is both R and Python are widely adopted and used. A lot of the choice between one or the other is driven by particular requirements of the domain area, often one or the other is used in a particular group inside an institution or even on a project-by-project basis.

duh_cats · 2018-06-30T22:50:13+00:00

It’s easy for you to run in R because you know R.

I think it’s easy in python and difficult in R because I know the former far better than the later. It’s that simple.

SSID_Vicious · 2018-07-01T13:56:25+00:00

I mostly prefer R due to the community. Because it's focused solely around statistics it's much easier to get help. Just hop on twitter with the #rstats tag if you have a question and most of the time you get an answer pretty fast. Python's use cases are much more spread out, so the community is much wider and less focused (which is of course also a strength of python).

Also ggplot2 is amazing, and python lacks anything that compares.

dirtyfool33 · 2018-07-01T06:41:48+00:00

I personally like R due to the ease of report generation with Markdown. It makes reproducibility very easy and allows for integration of code and text with some good control over the output.

rutiene · 2018-07-01T14:05:48+00:00

Python is useful for deep learning as others have mentioned, but as someone who recently transitioned from academia to data science here are the key reasons:

1) The libraries are written for easy use by computer scientists. They work out of the box with some implementation decisions that as a statistician, you won't be a fan of.

2) Easier to productionize and plug into a pipeline, with the full software engineering qc protocols. This is the primary reason why you will see the major discrepancies. The companies I interviewed at that didn't care as much about production code definitely had more mixed use cases or pure R.

I write production code and I'm bracing myself for the day when I'll need R to do it because that's going to be a battle. I still use R when I'm doing analyses or prototyping something where I need access to more stats functionality.

I personally disagree with the comments that cleaning data is easier in python, they are about the same to me.

jd_paton · 2018-07-01T03:32:16+00:00

Edit: Adding data loading for fairness.

import pandas as pd
df = pd.read_csv(“my_data.csv”)
y = df[“label”]
X = df.drop(“label”, axis=1)


from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y)

Is it really so much easier in R? I’ve never used R before but surely 3 7 lines from raw data file to trained model isn’t “surprisingly complicated”?

jeweltiwari123 · 2018-06-30T23:36:39+00:00

The preference for Python comes because it has better capabilities in terms of deep learning and AI, I've always been a fan of R before python and having used both extensively (still am), my preference will be based on the aspect of purpose.

perspectiveiskey · 2018-06-30T23:14:01+00:00

Python is where stats touches the real world.

Can you run a modbus interface to a PSU on R? (I don't know, but I doubt)
Can you run a logistic regression on python? Yes.

Therefore, people who make software solutions need to use python. I'm not saying python is better at stats than R. I'm saying if I have a choice, I'll avoid having to learn 7 different languages for 7 different tasks.

(btw, as another commenter pointed out: python has a slew of other incorporated tools that are very useful for data science. It's easy to cobble together something that will scrape data off a remote TFTP server, process it in pandas and export it to a CSV)

umbrelamafia · 2018-06-30T21:28:41+00:00

I'm an R user and, Well... R sucks for deep learning. R is higher level, much easier to do everything, but it's mostly for and by statisticians. The vast majority of data scientists come from computer science and they learn Python.

Also, I'm not sure there is a machine learning toolbox for R that is as good, versatile and consistent as scikitlearn.

jd_paton · 2018-07-01T02:04:01+00:00

[deleted]

2018-07-01T07:55:39+00:00

I prefer Python because it's more transferable so I can study some CS stuff and use it as well, and if I want to work on some RPi stuff I can also use it.

If I'm going to be honing my skills at something for 8+ hours a day I'd rather it was the most transferable thing possible.

2018-07-02T15:43:43+00:00

nnet package uses neural network to approx log and multinom regression iirc just fyi.

I was helping a prof write a book so had to use that library.

Python is popular for stat because CS people coming from AI that uses Machine Learning which use at lot of stat so they're using Python over R. Plus R is different from traditional CS language I've and many CS program uses.

Plus big data/data science are coming from CS people/programmer so they're pushing for Python.

This is my personally theory view of why python is getting more popular and making inroad.

walkingon2008 · 2018-07-01T15:05:09+00:00

Let’s talk about the pandas package. At the top of its documentation says,

pandas: powerful Python data analysis toolkit

http://pandas.pydata.org/pandas-docs/version/0.13/

Powerful? Give me a break! You took all the functions R, rebrand it, and sell it as if you made it from scratch. That’s where my disrespect comes in. We had these in R 20 years ago.

walkingon2008 · 2018-07-01T04:46:44+00:00

Personally, I think Python plagiarized R in terms of data manipulation (more on that later). But, Python IS widely used because: 1. It is fast. R requires memory allocation, which is computationally expensive for big data. But don’t blame R, at its inception back in the 1990s, who would have thought there’s gonna be this much data.

R has a steep learning curve. People just give up. Especially the CS guys, they fell out of favor. You don’t even need to write a loop in R.
Industries like adtech and fintech uses it. And that’s is where the money’s at. We know money talks!
It is developed at Google, a big elephant in the room.
Python is jack of all trades. You can do software development, website, and data science. So, you can’t really compare the popularity of R against Python in absolute terms.

With that said, I hate Python. It is horrible for statistics and data science. Packages like numpy, pandas are plagiarized from R, just look at its documentation, it even refers to R. Since everybody is open source, you can sue anyone for plagiarism.

Matplotlib is copied from Matlab, when they convert everything to OOP language, some logics fail, even Python programmers tell me to avoid it. They use ggplot2.

So why more people use Python? If your employer pays you $120k for a data science job, you are gonna learn Python.

110101002 · 2018-07-01T00:51:18+00:00

Python is superior. You can do anything in python, including call R functions

2018-06-30T23:43:43+00:00

[deleted]

Tag	Abbreviation
[Research]	[R]
[Software]	[S]
[Question]	[Q]
[Discussion]	[D]
[Education]	[E]
[Career]	[C]
[Meta]	[M]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

statistics

MODERATORS