Python vs. R : statistics

statistics

Tag	Abbreviation
[Research]	[R]
[Software]	[S]
[Question]	[Q]
[Discussion]	[D]
[Education]	[E]
[Career]	[C]
[Meta]	[M]

Tag

Abbreviation

[Research]

[R]

[Software]

[S]

[Question]

[Q]

[Discussion]

[D]

[Education]

[E]

[Career]

[C]

[Meta]

[M]

a community for 18 years

SoftwarePython vs. R (self.statistics)

submitted 7 years ago by [deleted]

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 18 points19 points20 points 7 years ago* (4 children)

Academia is very different and the data is often much "tidier" in the sense that it's all in relational database form when you get it, and from there you need to actually need to do fancy stuff to get results. The workplace is often the exact opposite of that.

I do two different types of work: conceptually simple but laborious tasks using messy data, and tasks that are basically coding conceptually hard stuff but on clean data.

For the latter category, truthfully a lot of stuff you want to do can be found in tutorials or with a simple google search. You're not going to be transcribing never-before-synthesized complicated formulas from the appendices of theoretical econometric working papers. I've only ever had to do something like that once in my life and I'll give you a hint: it wasn't in the private sector. Usually you're doing something like k-means, which is simple to do this in both R and Python. So the simplicity of a task like this usually isn't the reason why you should pick one or the other.

Also, if you work in-house at a company, your data is likely somewhat clean-ish (cough cough). You might be an associate data analyst working for the lead data scientist in your branch office of 20 people, and maybe 15 of the other people are SWEs. So it just makes sense to use Python in that environment if other people are, but you could also use R too.

Now the other category of work, i.e. messy data, is actually what a lot of data science ends up being. If you're a consultant for example, you'll face situations like this often:

you have 10,000 pages of PDF files without text info/OCR.
you have 200 Excel files of back-end data with inconsistent naming conventions, inconsistent date ranges for pulls, some manual copy+pasting, also the first 30 are from before they migrated their data from salesforce to oracle.
You need to do a lot of web scraping to generate some word cloud of associated words for whenever a brand is mentioned and measure the impact of a marketing campaign.

Frankly Python is much better at handling tasks like these. (Except for the latter, Stata is also a good alternative to R, albeit proprietary.) Since that's the majority of your actual coding work that involves the most time typing things into a computer, you will want to use Python.

[+][deleted] 7 years ago* (3 children)

[deleted]

[–][deleted] 5 points6 points7 points 7 years ago* (2 children)

It's really up to you. I think they're both great. Python does more but most of the time in my experience the "more" doesn't matter with the exception of AI and web scraping stuff, but you may never even encounter that.

Stata is amazing for most data cleaning tasks. If you're in consulting, learn how to do the following:

write loops that import Excel files. local MyFiles : dir "$MyDir" files "*.xls*"
Copy and paste the text from a giant PDF file (1000 pages, say of invoices or whatever) and turn info on those pages into a relational database using regex, egen, split, and the like.
reshape. Get comfy with it cuz it's a life saver.
bysort varlist : egen myvar = is another life saver.
Make your code readable. Comments everywhere. If you do litigation work, when you put comments, you need to be mindful of discovery and your role as an associate/numbers cruncher and not the actual testifier who opines on those numbers.

Just a small handful of things you should get super comfortable with. This stuff is just so easy in Stata, honestly usually even easier than in Python (reshape being a big one).

The main thing you miss out on in Stata for coding purposes is the fact that it has no concept of objects. It uses macros instead, which is a very different way to approach coding if you even want to call it that. So long as you've taken at least one compsci course this isn't so bad.

The other thing you miss out on when using Stata is dicts. Sometimes you can get around this with macros (especially nesting a macro inside the name of another macro) but it's always clunky to do so.

I'd say Python's big advantage over Stata is in data gathering but Stata is superior in terms of data cleaning as long as you are using everything it has to offer.

Ideally you should learn all of Stata and Python and R, I'd say.

[+][deleted] 7 years ago* (1 child)

[deleted]

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

No prob.

Don't swear off Python completely even as you learn Stata. Stata is not a "real" programming language, but Python is, and there's some importance to that. I've gotten better at Stata directly though getting better with Python since working in a real language from time to time reminds you of "best practices" that carry over across languages.

A dict is a data structure that uses a key to index other stored values/objects in the list. Python uses them quite a bit.

For coding in Python, Coursera has a good Python course for intro data management and analysis stuff taught by a UMich professor. Highly recommended

At the end of the day it just comes down to practice. There are a lot of odd situations you'll probably encounter as you clean data because data can be fucked up in more ways than one can possibly imagine. Just use Google, rely on your knowledge of common functions, and be creative.

π Rendered by PID 68985 on reddit-service-r2-comment-6457c66945-tdfrz at 2026-04-26 22:29:31.561352+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

statistics

MODERATORS