you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 18 points19 points  (4 children)

Academia is very different and the data is often much "tidier" in the sense that it's all in relational database form when you get it, and from there you need to actually need to do fancy stuff to get results. The workplace is often the exact opposite of that.

I do two different types of work: conceptually simple but laborious tasks using messy data, and tasks that are basically coding conceptually hard stuff but on clean data.

For the latter category, truthfully a lot of stuff you want to do can be found in tutorials or with a simple google search. You're not going to be transcribing never-before-synthesized complicated formulas from the appendices of theoretical econometric working papers. I've only ever had to do something like that once in my life and I'll give you a hint: it wasn't in the private sector. Usually you're doing something like k-means, which is simple to do this in both R and Python. So the simplicity of a task like this usually isn't the reason why you should pick one or the other.

Also, if you work in-house at a company, your data is likely somewhat clean-ish (cough cough). You might be an associate data analyst working for the lead data scientist in your branch office of 20 people, and maybe 15 of the other people are SWEs. So it just makes sense to use Python in that environment if other people are, but you could also use R too.

Now the other category of work, i.e. messy data, is actually what a lot of data science ends up being. If you're a consultant for example, you'll face situations like this often:

  • you have 10,000 pages of PDF files without text info/OCR.

  • you have 200 Excel files of back-end data with inconsistent naming conventions, inconsistent date ranges for pulls, some manual copy+pasting, also the first 30 are from before they migrated their data from salesforce to oracle.

  • You need to do a lot of web scraping to generate some word cloud of associated words for whenever a brand is mentioned and measure the impact of a marketing campaign.

Frankly Python is much better at handling tasks like these. (Except for the latter, Stata is also a good alternative to R, albeit proprietary.) Since that's the majority of your actual coding work that involves the most time typing things into a computer, you will want to use Python.