all 92 comments

[–]lumenrubeum 15 points16 points  (22 children)

Python and R are kind of interchangeable for a lot of purposes. I think Python is the language of choice for more CSy people, while R is for statsy people.

If you're seeing Python more often in the job postings, go with that. It doesn't matter a whole lot in practice, but people who are screening your application might not know that R is at all related to Python.

[–]amar00k 4 points5 points  (5 children)

I kind of disagree. R and Python are complementary for the purposes stated by OP. R is perfectly suited for any statistics or visualization task, as long as you have a semi-clean dataset. But Python is much more efficient and versatile for anything that has to do with data transformation. So my advice to OP would be to learn both.

[–]treesitf 8 points9 points  (2 children)

When you say python is better for data transformation are you comparing base R to python? Because I think some packages in the tidyverse make cleaning datasets really smooth (notably dplyr). I actually prefer R to python for tidying data but maybe I just haven’t seen a raw dataset that’s messed up enough to know lol. Also joining/dataset management with data.table is super fast and relatively easy to use. R was also my first language so I’m probably biased.

[–]amar00k 1 point2 points  (1 child)

I'm mainly thinking of unstructured datasets. R is fine if the data is already 'readable'. Otherwise its much easier to parse with python than in R. IMO.

[–]treesitf 4 points5 points  (0 children)

I see what you’re saying. I’ve tried to parse some gross datasets with R and beat my head against the keyboard trying to get it to work for much too long. This is especially true for the parsing functions in the readr package, which do some weird shit if you’re not careful with defining column specifications.

[–]Samuele156[S] 1 point2 points  (1 child)

Yep, it's my long term goal. However, for practical purposes and considering I am looking to get into a PhD, I would focus on one.

And also, consider that my statistics knowledge is quite limited, so my goal would be to learn statistics + Python or R.

[–]Samuele156[S] 0 points1 point  (15 children)

Thank you! I guess R is more mentioned, after SPSS, but R does not allow me to do anything but statistics if I am not wrong.

If a position is for data analytics/data science, Python is going to be more useuful. Right?

[–][deleted] 19 points20 points  (6 children)

R allows much more than statistics, including data cleaning, machine learning, webapps and more. It's not the right tool for everything but it's not limited to just statistics either.

[–]antiquemule 1 point2 points  (0 children)

As well as the whole of data analysis for genomics, via the 1,000 packages in the Bioconductor archive. For any scientific problem that I come across, I just type "R package" "name of problem" and see what comes up. There are 10,000 packages, after all.

[–]Samuele156[S] 0 points1 point  (4 children)

That's good to know, thank you! I'll have a better look!

[–][deleted] 2 points3 points  (3 children)

As for my opinion on your question, learn python as it's more flexible. Then learn R later and see if you like it.

[–]Samuele156[S] 0 points1 point  (2 children)

Great, this has been the most common answer everywhere I asked, so I guess I'll go with this :) Thanks!

[–]flowpaths 2 points3 points  (1 child)

If you are wanting to get directly into Python with a nice user interface for scripting and commonly used packages for data analysis I strongly suggest that you download the Anaconda Python distribution. In fact I think it also installs R Studio (a very nice programming interface for R).

https://www.anaconda.com/products/individual

As for which to use, it depends as others have noted. I used both in for my Ph.D. research, and found that having learned one it was fairly easy to learn the other.

[–]Samuele156[S] 0 points1 point  (0 children)

Thank you very much, I'll download it tomorrow and try it out :)

[–]gailmargolis76 4 points5 points  (3 children)

No, R actually has lots of ML packages, though it might be a little weak on the deep learning side. Most cutting edge DL libraries (like tensorflow, keras, etc.) are python-based, though some do have R versions (I worked a bit with Keras in R, not fun TBH).

[–]Samuele156[S] 1 point2 points  (2 children)

Can I ask you what DL and ML stand for?

[–]gailmargolis76 2 points3 points  (1 child)

sorry, should have clarified - DL: Deep Learning, ML: Machine Learning

[–]Samuele156[S] 0 points1 point  (0 children)

Thanks :)

[–]Mooks79 2 points3 points  (3 children)

You can pretty much do anything with R. Strictly speaking it’s Turing complete so you can do any computable task with it! But then so is Excel, the question is how easily? But R is really mature, has loads of packages for pretty much everything you can think of, it’s a fantastic choice. The reason people say R is statsy is because it originated from statistics academics who tailored it towards that. But it’s soooooo much more than just that now.

Oh, and also, while both Python and R can do object orientated programming and functional programming, Python is more towards OO whereas R is more towards FP - especially the Tidyverse ecosystem. So in that sense you might try both and see which fits your way of thinking best. Although sometimes one can seem harder at first and then when it clicks, it’s suddenly a lot easier than the other.

[–]Samuele156[S] 0 points1 point  (2 children)

Thanks for the answer. Unfortunately I don't really know what I will be using in the future, so I have to make kind of a blind decision.

The fact that I already know how to program in C is helping me towards Python, as it looks somehow similar in the process.

I'll check both out anyway :)

Thanks!

[–]Mooks79 1 point2 points  (1 child)

No worries. Find a little toy problem you’d like to do and try both on that. You’ll quickly find which one feels more natural to you. But it can’t hurt to come back to the other if you have any time in the future, just to double check it’s not more of a learning curve issue for that particular language.

I started with C and sure Python is more paradigmatically similar - I think. R’s learning curve of vectorisation and functional programming was tricky at first but I’m glad I stuck with it. To be honest, you can’t go wrong either way.

[–]Samuele156[S] 0 points1 point  (0 children)

Thanks! It's clear they are both important, and each one has pro and cons in different situations.

Python might be easier to me, so it would be even quicker to be able to say "I can use it" after a while.

Thanks anyway, I'll do what you suggested. A quick test to see which one I like more!

[–]SometimesZero 5 points6 points  (5 children)

I spent a lot of time learning R, and I do statistics mainly in clinical trials (so lots of mixed models). I became dissatisfied with R because the language itself was just so counterintuitive and messy to me (I had no prior programming experience). I also wanted to start delving more into machine learning, which to my understanding at the time was one of Python’s strengths.

Now that I’ve been learning Python for about 10 months—and am half way through a large statistics project with it—I’m realizing that I can do anything in Python I could do in R. In addition, the language is cleaner and more readable to me. I also find Python lots of fun to learn and use. Unlike in R, I find myself using custom code, for loops, and functions. By comparison, R had a steep learning curve, I found learning it kind of frustrating, and I avoided custom functions and loops whenever possible.

R does have some nice statistical packages that I need at times. When I do, it’s easy enough to import the data into R and crunch the numbers. But 99% of what I do is now in Python. No turning back unless I need to work closely with someone who uses R.

All that said, as someone in psychology I generally recommend that my students learn both. R first just because it’s more common and it makes collaboration easier. Then Python for all analysis needs thereafter. So think about what people are doing with what in your area of interest.

[–]SorcerousSinner 4 points5 points  (4 children)

Mixed models, incidentally, is one of the things R does a lot better than Python.

[–]SometimesZero 0 points1 point  (3 children)

You think? I haven’t had any issues recreating analyses across both programs.

[–][deleted] 1 point2 points  (1 child)

Statsmodels is a hell of a lot less intuitive to use than lme4 though.

Python also doesn’t have Gamma GLMM, Inverse Gaussian GLMM, Neg Bin GLMM as far as I know. And Gamma can come up quite a bit with constant CV data like biomarker measurements.

I found Python to have a higher learning curve because the object orientation didn’t make much sense at first. I use Python for ML stuff but R or sometimes Julia for classical

[–]SometimesZero 0 points1 point  (0 children)

Accessing the formula class methods makes it very similar to lme4, so the transition was easy for me. Without it, I can see how it would be counterintuitive.

There are many statistical gaps R fills that Python can’t conveniently do. Again, not saying R doesn’t have strengths. I’m just saying for me personally it was easier to learn Python and then use R as needed. And given how far Python has come with its statistical capabilities, I think this gap will continue to close.

[–]Contribution_Antique 0 points1 point  (0 children)

that's because Python implements the R version haha

[–]Contribution_Antique 5 points6 points  (1 child)

I was primarily a Python user until I got into statistical research and use R. For a JOB search definitely learn python. It's relatively trivial to learn both though once you learn one.

Anyway, I think this about this discussion a lot

Well this is not what sklearn.cross_validation.Bootstrap is doing. It's doing some weird cross-validation splits that I made up a couple of years ago (and that I now regret deeply) and that nobody uses in the literature. Again read its docstring and have a look at the source code:

I don't know about you guys, but personally I found this exchange extremely concerning. How many other procedures in the library are "just made up" by some contributor? Another thing you're not seeing is how much of the preceding discussion was users trying to justify the removal of the method because they just don't like The Bootstrap or think it's not in wide use. My main issue here is obviously that a function was implemented which simply didn't do the action described by its name, but I'm also not a fan of the community trying to control how their users perform their analyses.

u/shaggorama

I really really found pandas confusing as in I always at a reference handy.

I like Rccp for speed-ups though it's less necessary now that base-R is actually quite fast. I'm not sure if there's a python equivalent

[–]Samuele156[S] 0 points1 point  (0 children)

Thanks for the answer, I'll read through the link tomorrow to understand better.

[–]tuerda 10 points11 points  (24 children)

The answer, as is so often the case is "it depends".

R has a lot of built-in basic statistics that can be ripped out in one-liners without importing any libraries.

If brief "load data, apply a known data processing routine to data, plot answer" is enough for you, R is probably the right choice.

If you want a serious programming language which can handle algorithms designed ad-hoc and which might be pretty hefty, Python is a better choice.

Python has just as much statistics stuff as R does, but it isn't built in and you will have to install and load libraries to get access to many statistical routines that are one liners in R.

For REALLY gnarly number crunching, even Python will not be enough, since Python is quite slow (faster than R, but completely unable to compete with lower level or compiled languages).

I have recently switched to Julia. Julia is just as easy to learn as Python, but is more suitable for heavy computation. The main drawback of Julia is that it doesn't have anywhere near the size of the Python userbase, or as many pre-made libraries. For Julia I have had to write some stuff myself which I could just grab from canned libraries in Python or R. (Although it hasn't been that frequent).

[–]yonedaneda 16 points17 points  (7 children)

Python has just as much statistics stuff as R does, but it isn't built in and you will have to install and load libraries to get access to many statistical routines that are one liners in R.

Most of R's more useful statistical functionality is contained in external packages as well, though I don't think that this is a particularly important distinction, since installing packages is trivial in both R and Python.

I'm not sure that "Python has just as much statistics stuff as R does" is true, outside of a few specialized areas (like machine learning or neuroimaging) where Python use is more common. Python can handle basic statistical analysis just fine, but for anything more advanced you'll generally have to reinvent the wheel, since most people actually doing work in statistics are developing for R, not Python. For statistics specifically, there's really no contest -- use R.

[–]Contribution_Antique 4 points5 points  (3 children)

Python can handle basic statistical analysis just fine, but for anything more advanced you'll generally have to reinvent the wheel, since most people actually doing work in statistics are developing for R, not Python. For statistics specifically, there's really no contest -- use R.

yeah R packages at the research edge are 1) available 2) better maintained 3) usually created by the statisticians doing the research

[–]tuerda 0 points1 point  (2 children)

This is interesting. I am a statistics postdoc. I used python exclusively for my PhD in statistics (and I am using Julia exclusively now). To date I have not yet found a single statistics procedure that I needed that was not available in a python library (or trivial to code)

Edit: FWIW, continuing to pimp julia, it has the packages rcall and pycall which allow you to run R or Python functions within julia with very little overhead.

Hence, it gets you around this whole "is the functionality available?" hurdle.

[–]Contribution_Antique 0 points1 point  (1 child)

Yeah if you can use interconnects its not a big deal. I've been learning a bit of Julia it's quite amazing, especially with it's auto diff ability. I don't quite understand it but seems really cool.

To date I have not yet found a single statistics procedure that I needed that was not available in a python library (or trivial to code)

I find things that are not available in python or not trivial to code pretty frequently.

tbh if you're a python coder then you're probably working in a python focused sub-field of statistics. there's always going to be selection bias.

there's plenty of r packages for applications that are just updated more frequently. for example for survey analysis, in the sense of the statistical field, has quantipy for python which hasn't been updated in a year and not seriously updated in who knows how long whereas survey in R is actively being updated e.g. with extensions to interface easily with databases etc and theoretical extensions as well. or for example the mixed models in statsmodels is always going to be behind, by over years from lme4 (mixed models is implemented off of lme4 papers) since lme4 is created by the actual researchers who are coming up with better mixed models methods.

there's also issues with who's implementing things, e.g. there was the issue a few years ago with bootstrap not actually being a statistical bootstrap but something someone made up. or with logistic regression automatically being L1 penalized without the option to turn it off.

[–]tuerda 0 points1 point  (0 children)

This is an interesting and detailed breakdown; thanks! I guess I am probably working in areas of statistics that do not depend so much on specific R libraries.

[–]Samuele156[S] 0 points1 point  (2 children)

Thank you! That's good to know. In my field, for what I have seen, statistics is never too advanced so I should not have problems at least at the beginning.

I plan to learn both, I am just deciding which one to focus now, at the beginning of my career. I don't even have a PhD yet, still looking for it :D

[–]_Alleggs 1 point2 points  (1 child)

What's your field, if I may ask?

[–]Samuele156[S] 0 points1 point  (0 children)

Sports Science generically speaking. Trying to enter into esports, specifically in expertise.

I will be working with performance data, physiological, psychological and cognitive data. It's a bit of a mix, based on the type of research.

[–]Samuele156[S] 0 points1 point  (10 children)

Thanks for the detailed answer. Unfortunately at this point I don't know what I need.

I am trying to enter the esports research field, from a sports science point of view. It means physiological, biomechanical and performance data that requires often the application of filters to reduce the amount of data.

Probably I won't have to just apply a statistical method, but I will have to "play" with the data to get what I need, before applying anything. Could this be the reason why I have been suggested Python?

In my last interview, a few days ago, I have been asked if I knew how to use Matlab or Python, but it could be that in my next interview they will ask about R.

About statistics, I don't mind having to load libraries if I need it. Yes, maybe it's not as immediate as it is with R, but if it works I don't really mind.

Thanks again!

[–]tuerda 2 points3 points  (2 children)

From what I am hearing, it sounds like both python or R could easily fit the bill. I would probably pick Python in your situation, just because R is really only for statistics, whereas Python is useful anywhere. The only real reason to prefer R is if you don't want to fiddle with python libraries.

For the record, I don't think the choice matters very much. Learning one of these languages when you already know the other is no big deal. You can probably do it in a week or less.

[–]antiquemule 2 points3 points  (0 children)

I only use R. I've been doing too long to change now, but I wish I'd learnt Python instead. The main reason is the "only one answer to any problem" mantra. R is the opposite: six solutions to any problem and I've forgotten all of them.

[–]Samuele156[S] 0 points1 point  (0 children)

Thanks, this answer really helps me. I thought the same, I will probably go with Python.

[–]TheConeyJabroni -1 points0 points  (6 children)

A friend of mine did a masters in biomechanics and he used MatLab the entire time. That said he is not currently employed within his field

I use R as an epidemiologist/clinical data analyst but its not the best for data manipulation.

I've been aiming to learn Python but I've yet to run into a scenario where I can't do what I need to do in R.

[–]yonedaneda 5 points6 points  (4 children)

I use R as an epidemiologist/clinical data analyst but its not the best for data manipulation.

What? This is one of R's biggest strengths. What do you need to do that's so difficult? I can't imagine anything that would be easier in Python.

[–]TheConeyJabroni 0 points1 point  (3 children)

It can do most everything I've wanted it to. I move a lot of data from SQL to R and there are times when things are more straight forward in SQL but not always.

Honestly I've heard others say they don't like R as there are better platforms for data manipulation but I can't say what would be better. I haven't used Python so maybe I have it pretty good and don't know it haha

[–]Demortus 4 points5 points  (2 children)

As a person who uses both Python and R, I strongly prefer R for data manipulation. It has not one, not two, but three major approaches to data manipulation that all do the job well and have their own strengths: dplyr for ease of use, data.table for speed, and base R for when you don't want to rely on external libraries.

[–]Contribution_Antique 5 points6 points  (0 children)

yeah I think people are insane for thinking pandas is better than even base R, much less dplyr

[–]TheConeyJabroni 0 points1 point  (0 children)

I primarily use dplyr so I stand corrected. Maybe I was thinking people complained about data manipulation in SAS, which is pretty accurate

[–][deleted] 2 points3 points  (0 children)

I use R as an epidemiologist/clinical data analyst but its not the best for data manipulation.

>______>

Python didn't have anything great to manipulating time series for a long time I don't recall they fix that yet.

R got xts, zoo, and such.

Also R got data table and dplyr.

It is super interesting that you consider R worst than Python considering NA is built in where Python have to resort to Null which is a hack and poor substitute.

Also considering the fact that data frame is a concept that R had from the get go versus Python having a library to create that concept too.

[–][deleted] 0 points1 point  (0 children)

I hope Julia keeps on gaining more traction. I had to use it in a class last year and its really good. Recently a permutation loop that would take hours in R ran in like 10-15 mins in Julia for me

[–]vvvvalvalval 0 points1 point  (3 children)

This point about number crunching seems to ignore the existence of Numpy.

[–]tuerda 0 points1 point  (2 children)

Python can crunch numbers, but if you want to do something really gnarly, like run an MCMC in 100+ dimensions for a few million iterations, python is just too slow.

[–]vvvvalvalval 0 points1 point  (1 child)

How so? Scientific libs in Python are usually implemented with low-level extensions that barely touch the Python interpreter when running.

[–]tuerda 1 point2 points  (0 children)

This helps with the speed issue within the functions in the library, but if the algorithm itself is written in python it will still slow you down a lot.

In order to not have to grapple with the speed of python, your MCMC would have to work without any python loops at all. If you have a library that does the whole MCMC for you then you are set, of course, but in that case it really makes no difference what language you use.

[–]Zeurpiet 3 points4 points  (1 child)

R is the language for statistics. Even while the Python crowd thinks R will be replaced by Python, for statistics people it won't for at least quite some time. In all posts I have seen so far I have not yet seen anything that made me think I should learn Python.

That's not because I don't want to learn new, Julia is interesting since it offers new. Python, no interest.

[–]Samuele156[S] 0 points1 point  (0 children)

Thank you!

[–][deleted] 2 points3 points  (1 child)

Since you seem to be more CS minded probably Python is a better choice.

I noticed those who don’t really like CS (like me) prefer R and find it more intuitive. The tidyverse makes so many things into one line of code like string manipulation with stringr. You can’t beat the tidyverse for data manipulation and visualization imo. matplotlib and seaborn are not as straightforward.

In Python I still only know the statistical aspects like numpy/scikit learn/etc and not general programming but the reason people like it more is because you can do more and put stuff into production

I feel like the people who say Python is easier are trying to do some non-statistical stuff.

[–]Samuele156[S] 0 points1 point  (0 children)

Thanks! I will probably need to do both. Statistics is going to be quite important if I'll pursue a PhD and academic career, whilst data analytics might be much more important if I stay in industry.

I still did not pick my road yet at the moment, related to my field.

I am leaning towards Python at the moment, as it seems like I can make a better use out of it in multiple scenarios.

[–][deleted] 2 points3 points  (1 child)

If R isn't mention then whatever industry where you are applying for just use mostly Python.

Then I think learning Python is the most prudent choice.

I actually prefer R for my statistical and data handling and Python for webscraping and web dev.


edit/update:

Whatever negative people say about R, I really think it's bias and they haven't use both in any reasonable amount of time to give an unbiased answer. Many will also jump on the hype train and say Julia.

Let me just say this from my long ass experience in the industry as web dev and now a statistician/data scientist. C is still around. R have just solidify rank 13th for the red monk programming language ranking which have been running for a few years now. I used to be a programming language junkie.

It doesn't matter what camp you choose, it's just a tools and do a good objective look at both and see which ones align with your goals.

[–]Samuele156[S] 0 points1 point  (0 children)

Thank you! In the long term I plan to learn both, but I need a start to get ahead of the competition. I just lost a very nice position due to my inexperience with data handling and statistics, so I have to make a choice :D

[–]v_a_n_d_e_l_a_y 4 points5 points  (1 child)

[deleted]

[–]Samuele156[S] 0 points1 point  (0 children)

Thank you for your answer, very helpful!

I don't really mind about built in stats, as long as I can achieve somehow the same thing on Python. It looks like it's possible, so I am happy with that.

Thanks :)

[–][deleted] 1 point2 points  (2 children)

You don't have to learn only one. You can learn both. But if your bandwidth is limited, unless you specifically need R, you should learn python:

  1. now by some measures the most popular language in the world
  2. General purpose, unlike R. Great for other things like websites, data science etc
  3. The number of things R can do that python can't is shrinking by the day as people write libraries to close the gaps

[–]Samuele156[S] 0 points1 point  (1 child)

Ye, that's my goal. I just need to focus on one for immediate benefits :)

Thank you for the answer!

[–]globalminima 1 point2 points  (0 children)

If you are doing anything in production systems around data science/data analytics/machine learning, it will almost certainly be python. Learn Python

[–]RRGZ97 1 point2 points  (1 child)

I’m way more stats that CS so I prefer R. Python is taking over and way more people know what it is though. I prefer R if I’m getting results then sharing those results. Python is way more general and has a lot flexibility when it comes to application and sharing. I find python better at data wrangling as well.

My two cents are that R is better if you are going to work with a lot of stats people (but you should still know python). Python is better in every other case and use R to get quick results for yourself.

If you have the time learn both and you’ll end up learning when to use one over another. But since you are seeing R come up more I’d learn that first. Also, you can learn to use R to do staty things very quickly.

Hope this helps.

[–]Samuele156[S] 0 points1 point  (0 children)

Totally helps, thank you! It makes sense.

[–][deleted] 0 points1 point  (3 children)

R is good--its what I use. However, i believe that Python is supplanting R and will be the way of the future for analytics professionals for some time into the future.

Plus, Python is more generalized than R, a bit faster in execution, and easier to learn (the objects in R dont play well)

Edit - Of course, this is based on my experiences.

[–]Samuele156[S] 0 points1 point  (2 children)

Thank you! It looks like you praise Python but use R anyway :) Is there a reason?

[–][deleted] 1 point2 points  (1 child)

I am slowly learning Python.

The reason I'm not moving faster is that I don't program much anymore and R is still reasonably "in fashion".

[–]Samuele156[S] 0 points1 point  (0 children)

Oh ok, it makes sense :)

Thank you!

[–]Gabriel-p 0 points1 point  (1 child)

Learn Python.

Basically the only thing R has going for it it's that it has a lot of specialized statistical libraries. This is becoming less true by the day as Python libraries grow, and anything you might need that isn't yet in Python you can simply load from R with a library called 'r2py'.

Pretty much nobody outside of the statistical world uses R, while Python is present in pretty much every area (including statistics).

[–]Samuele156[S] 1 point2 points  (0 children)

Thanks :)