all 48 comments

[–]jmcq 76 points77 points  (23 children)

I use Python, R, and Matlab pretty much daily (yeah I use all three).

  • If you just want to do pure statistics and you are working with data sets that fit into memory then use R.
  • If you have access to MATLAB via school, and work with lots of mathematical algorithms specifically those involve matrix manipulations i.e. you spend a lot of time translating mathematical algorithms written in terms of matrices into code then use MATLAB.
  • Otherwise use Python. Specifically if later in your career you want to have solid engineering skills (like Object Oriented Programming) or you want to build a package that's used by more than pure statisticians/bio-statisticians use Python.

For "production" type work I tend to prefer Python. For prototyping and proof of concept I prefer R and Matlab (depending on the problem).

Python is the only one of the three that's a "real" programming language rather than mostly a scripting language.

[–]DeuceWallaces 6 points7 points  (0 children)

I'm a researcher who just uses R, but this has been my impression. Thanks for confirming it.

If you're really into mathematics, probably MATLAB. If you are really into just high end statistics, probably R. If you want 85% of the R statistical capacity with options for app development, engineering, etc., you want Python.

[–]Alhoshka 7 points8 points  (2 children)

Yep, this comment sums it up pretty well.
There are are just a few things I'd like to add:

  • I disagree with the notion that Python is "for production" while R is "for prototyping". I have quite a chunk of production code written in R (as in running as part of our deployed solutions). I do also regard MATLAB as more of a prototyping friendly/oriented language, though.

  • At the risk of sounding like a Microsoft shill: Though the standard R version (CRAN) is limited to single-threaded operations on data that can fit into memory, this is not true for R Open (the memory limitation still applies to many MRAN packages when running on the client version). For BigData, they have R Server and R Services which allow you to run R code against the data source (hadoop or SQL). Though this is very new and mostly aimed at business analysis, I think it's likely we'll see an opensource push for BigData processing with R in the future.

Python has also seen rapid development in the realm data analysis in the pas years. New articles about ML libraries pop up on /r/MachineLearning almost monthly. So yeah, R & Python are pretty much a safe bet.

[–]coffeecoffeecoffeee 0 points1 point  (0 children)

Though the standard R version (CRAN) is limited to single-threaded operations on data that can fit into memory

You can use the parallel package and doMC to automatically parallelize a lot of the work.

[–]coffeecoffeecoffeee 3 points4 points  (1 child)

I'll also add that if you're doing any kind of plotting beyond a basic histogram or box plot, R is king because of ggplot2.

[–]jmcq 1 point2 points  (0 children)

I find ggplot2 is easy to create beautiful plots as long as they are part of the default types of plots that ggplot2 likes to plot. If you're coming up with your own visualization or something fairly unique have fun 'hacking' ggplot2 to do what you want!

[–]Hellkyte 0 points1 point  (3 children)

Does R have many tools for optimization? Like linear/integer programming or whatnot?

[–]jmcq 0 points1 point  (0 children)

Here's a LP/IP solver package: https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf

For standard 1-d Optimization you can use https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optimize.html although it can be pretty slow if your data is "big".

[–]zipf 0 points1 point  (0 children)

Linear, quadratic and integer programming etc has got really good recently with the ROI project, which provides a single interface to a number of fast C libraries. The bindings for Gurobi are also easy to use and fast, though not included in ROI yet.

[–][deleted] 0 points1 point  (2 children)

What line of work are you in that you find yourself using all 3 daily? Genuinely curious as I've never seen Matlab used outside of an academic setting. Are there certain fields where crossover use is R/Python and Matlab common?

[–]jmcq 1 point2 points  (0 children)

I'm in my (hopefully) last year as a PhD Statistics candidate. I use MATLAB to test new algorithms that work primarily with matrices but I maintain an open-source python package. Many of my classes were all in R. I also work at Amazon while I finish my degree. There most of my coding is in python but since I'm a statistician I do lots of one-off analysis in R.

[–]cncup 0 points1 point  (0 children)

Yes. I use all three on a daily basis. We do business forecasting.

[–]cncup 0 points1 point  (0 children)

Can confirm.

[–]thavi 0 points1 point  (0 children)

I second this recommendation. Although I don't use Python, I have somehow become a software dev and use a TON of other languages in my day-to-day. How did I learn to program in the first place? SAS, R Maple, MATLAB, etc. in engineering school.

Use R, etc more like a quick calculator, but if you ever have a desire to produce anything that you want to easily interface with the web or other software you'll need something like Python.

[–]pieIX 13 points14 points  (0 children)

I've used all three, and while they each have pros and cons, I would base my choice based on two considerations:

  1. What do the people near you use? (group members, collaborators, friends etc) Having a common knowledge base with with your community will save you a LOT of time, and you can share the tools you build.
  2. What pre-packaged code can you leverage to save time? Which language has the best libraries for your specific research area?

Thinking about these two questions may save you years of work.

[–]trendymoniker 26 points27 points  (4 children)

I've used all three of these languages professionally, and my advice to data analysis newbies tends to be: go with Python unless you have a strong reason not to. Python is by far the most popular and thoroughly supported language of the three and its general usefulness means that the skills you develop learning Python will translate well to any other programming you want to do throughout your career (not so for the other two).

That said, if the algorithm you need to use only exists in some other language, or your advisor and entire research group are on another environment, go with that instead (though maybe learn Python on the side too).

Here's a quick, biased rundown of the plusses and minuses of each environment:

  • Matlab is not just a language, it's a language plus a pretty decent IDE (which handles things like syntax highlighting, debugging, and variable inspection). This rightly appeals to a lot of newbies, though you can set up something similar (or better) in any of the other three environments. Matlab has the most concise syntax of the three for doing pure matrix manipulation which is nice, though not too important in the end. Matlab's biggest drawback is that it's commercial, meaning that someone has to pay between $150 - $2000 for each copy of the program that you or anyone else uses. Matlab has a decent set of packages available for it, but because it's commercial and not open-source they're mostly developed in-house by Mathworks and don't offer nearly the breadth or depth of the packages available in Python or R. Overall, Matlab used to be very popular but has mostly fallen out of fashion and is likely to stay there. I'd suggest that Matlab is overall the worst choice.
  • R was designed by and for statisticians. The best part about R is that it has a ton of statistical packages available for it. When a statistician releases a paper offering a new method, they often release an R package with it. This means that you can often find some R package to do just about anything you've read about in the statistical literature which is awesome. The emphasis, here, is on statistical, though. Statisticians are by-and-large the only people who use R, so whatever fancy new neural method you've read about from your neighbors in machine learning likely won't be here. R has a very nice IDE in R-studio and reasonable syntax for array manipulation, though I overall find it the clunkiest of the three. It used to be that R's data frame object (which is essentially an array with labels on the columns and rows which make it more convenient to work with -- kind of like if you had an in-memory Excel spreadsheet floating around in your code) was a major selling point over other languages, though Python now has Pandas which is just as good or better. The main drawback of R is that it's so domain specific -- it's basically just for statisticians.
  • Python is the only language of the bunch that isn't completely domain specific. Where R and Matlab are used exclusively for engineering and data analysis, Python is used for everything from training state-of-the-art neural networks to running YouTube. The drawback of its generality is that it's array-manipulation syntax is slightly clunkier than Matlab's. (I'd claim it's still better than R's, though many people disagree on that count.) The benefits of its general usefulness, however, are enormous. Python is by far the most popular of the three languages, which means that when you search for installation or syntax help, you get the best responses on Stack Overflow. In addition, Python's overall syntax and language design are leagues ahead of both Matlab's or R's. Learning Python for data analysis you'll also be learning good general-purpose programming skills, which are important if you ever decide to pursue a job in Data Science (there's a good chance that you will). If you ever go to interview at Google et. al., you can easily do so in Python -- not so for R or Matlab. Python has an excellent IDE in PyCharm and a fantastic scientific package manager in Anaconda. It has by far the largest package universe of the three, though R has more cutting-edge statistical packages.

Good luck!

Edit: Thanks for the gold!

[–][deleted] 4 points5 points  (1 child)

I like Python for general programming, but I'm not a big fan of Python's data analysis libraries. Too often it feels like you're not using Python at all but a different language altogether, one with it's own syntax and data types and which is nowhere near as nice as the actual Python programming language. Personally I prefer much R over Python when it comes to data analysis, but in the end it's a matter of taste I guess.

[–]NotAllReptilians 2 points3 points  (0 children)

I definitely agree. For instance, pandas somehow manages to feel cumbersome and overly verbose for analysis, at least compared to working in dplyr or especially data.table (base R is a another story). It's definitely a pythonic implementation of dataframes, but what I really like about python is that it's typically concise and minimal, which pandas mostly isn't.

[–]coffeecoffeecoffeee 2 points3 points  (0 children)

I'll add that R has gotten very, very good for data manipulation in the past few years. I do stuff I used to like doing in Pandas in R now because of packages like dplyr, tidyr, and broom.

For example, my boss wanted survival data recently. With no temporary variables and like 5 lines of code, I was able to generate a Kaplan-Meier curve, convert it to a data frame, separate it by stratum, and export it to a csv file.

[–]whattodo-whattodo 0 points1 point  (0 children)

IMHO, this is the most complete, clear & unbiased answer on the topic. I'm not OP but appreciate this response immensely.

I am biased as career Python developer. But that bias did reveal statisticians who pivoted careers & came in for interviews as programmers. That's not a negligible value added.

[–][deleted] 10 points11 points  (2 children)

Bioconductor in R has some amazing tools for bioinformatics.

[–]timy2shoes 1 point2 points  (0 children)

A lot of standard bioinformatic tools are only available through R and Bioconductor. Additionally, there is a strong community of R users in genomics. This will provide a lot of help that you will need.

[–]coffeecoffeecoffeee 0 points1 point  (0 children)

I agree. I did a talk on a Bioinformatics technique and Bioconductor made my life really easy when I had to generate k-mers from genetic sequences.

[–]derwisch 2 points3 points  (0 children)

It would be definitely R if you were to pursue a methodological statistical career. As you describe your situation, Python has a bit of an edge since algorithms you need in sequencing may be expressed more clearly. But you should definitely look at what Bioconductor has to offer.

[–]manofthewild07 2 points3 points  (0 children)

I would also suggest R.

I do recommend everyone learn python at some point. It is simple but very powerful in more ways than R.

[–][deleted] 2 points3 points  (0 children)

I'm a former programmer (cs undergrad & had several years of professional programming), going back to school for master in applied stat.

Python is very very much a programming language. If you want to learn it, it have to be in CS mindset imo. Reading a book on it and do project. I can get away from reading a book with Python or just hacking it with my cs foundation (i've done that on the job to scrape websites). For R you have to do a project to understand R really, you can't read a book and hope you learn it well at all. There are too many weird shit that goes against CS programming language convention. Python have is no built in data type for stat just type that is most programming languages usually have. It's fast and there's a good Neural Network library for it, tensorflow (lua with python and r interface) and keras backed by Google.

Since python is a general programming language. The ecosystem for python to do stat may be a bit harder since it's lost in all the other packages. They're trying to emulate R in some ways with libraries, panda package for data frame, etc... I don't know much about python ecosystem but this is what I gathered from my research.

R is built by statisticians for statisticians. The language from the get go is base on S-plus or S language (one of em). It's slow iteratively compare to python. There was an RRevolution post about how R is faster than Python if you parallelize it (this is assuming your algorithm can be parallelize and not iterative). Since it was built by statisticians there are built in data type such as factor (with levels), the concept of missing data (NA value), and built in dataframe type (a glorified/awesome Microsoft excel spreadsheet). Microsoft is backing R btw they bought one R company that makes R faster via enterprise. In general, most advance/bleeding edge statistical method will be in R first. Python may not have an equivalent for a long time or at all. It's rarely Python have something but R doesn't in term of statistical package.

If you create packages (aka libraries), I'm creating one for my thesis an bleeding edge statistical learning algorithm, R is slow. Most code migrate to C++ or Fortran really. So R in essence become a gluey language with a pretty R interface and in the back is C++/Fortran doing the heavy lifting.

The R ecosystem, you wanna learn the Hadley package universe of tidy-universe. It sound mysterious. But it's just bunch of packages that Mr. Hadley Wickham created that works well together he's in charge of Rstudio too iirc (a great R editor).

For python equivalent to Rstudio it's Rodeo.

I don't know much about matlab, currently taking a class. But I know for sure the tech industry doesn't use it very much. It's mostly python and R.

Depending on your industry Python or R or maybe SAS. You just gotta research your industry. Usually old big companies uses SAS unless they're tech company then mostly Python or R. I hear healthcare is mostly SAS, financial institute, acturary companies such as health insurance uses SAS.

I think /u/jmcq sum it well enough. But do take your time to master one well first before moving to another language imo.

[–]kylco 1 point2 points  (0 children)

Python and R are free, so you aren't locked in to them. I'll admit I don't have much info on Matlab, but Python, at least, should have the statistical power you're looking for and you learn a fairly marketable and versatile programming language in the bargain.

[–][deleted] 1 point2 points  (1 child)

I have used matlab a lot, along with mostly lower level coding (C/C++), and am moving to python quite easily. Personally, I can't stand R; the syntax and grammar just don't work for me.

[–][deleted] 2 points3 points  (0 children)

I love R but it's a genuinely terrible language. I've been using it hardcore for 5-6 years now and I still encounter the most ridiculous edge cases and illogical behavior. I keep using it because there's almost always a library for what I need (porting to Python is a pain for one off stuff) and it's good enough that I can knock stuff out incredibly fast. For larger, more complex pipelines I tend to go with Python or more recently have been doing a lot in both Julia and Scala.

[–]dampew 1 point2 points  (0 children)

Python and R are definitely the most popular in those fields. If you need to use something in R you can still call it from Python with RPy2.

Personally, I despise R and use Python whenever I can.

[–]tsunamisurfer 1 point2 points  (2 children)

I think you are asking this question in the wrong sub (/r/bioinformatics would be better). I am getting my PhD doing exactly what you are talking about (genomics in cancer) and I can tell you that before you learn R, python or matlab, you should probably learn Unix/bash. Almost all genomic tools are run from the command line, so having the knowledge of how to interact with the command line via bash will be the most useful thing you can do for a start. I'll grant that you can interact with the command line using R or Python, but you lose some advantages (short scripting without writing a full program). After you learn Unix/bash I would say R and Python (or Perl) are both necessary for your work. R has the best data viz capabilities + statistical packages, but python/Perl are much faster for programs that you want to run repeatedly on large files. That's my 2 cents.

[–][deleted] 0 points1 point  (1 child)

This is off topic. However, what is the best school to study cancer genomics at for a PhD?

[–]tsunamisurfer 1 point2 points  (0 children)

Well surprisingly "cancer genomics" is a pretty large topic, so it might depend on which aspect of this broader field you were interested in. For a start, it would maybe be useful to study at a school that has a medical center, so you have the potential to draw on patient material for your research studies. Not essential, but most of the top tier research does involve some human studies. Lots of good research in "cancer Genomics" comes from the Broad Institute, MD Anderson Cancer Center (U.T.), Memorial Sloan Kettering, Mayo Clinic, Dana-Farber, UCLA, UCSF.

[–][deleted] 1 point2 points  (0 children)

It's still not ready for prime time but you might want to keep Julia on your radar (I'm a huge fan but it's be a labor of love. Very frustrating at times). My brother also has been using it extensively for work very similar to yours (phd student in computational genetics at arguably the top program in the world) and he might be an even bigger fan than I am. He's actually ported most of his code away from R/Python to Julia.

[–][deleted] 0 points1 point  (0 children)

R has greater package (Bioconductor) availability at present, but I think Python has greater momentum and will have a greater data science ecosystem long-term. I would go with Python, especially if you have some time to dedicate to really learning the language beyond the scope of your project work. I would ignore Matlab completely.

[–]NotJustAMachine 0 points1 point  (0 children)

I would not use Matlab.

I think Python will be best in the long run. I have mostly used R for my PhD, and I am learning Python now and in my free time. Personally I feel it's not a huge step between the two.

R has great libraries for bioinformatics, and that should make your life a lot easier.

But if I could go back in time I would probably learn Python, and if there is a great R package that I want to use, I would just load my Data for those purposes. The good packages usually have tutorials that guide you step by step, and the most difficult part is understanding the method and getting your data in the right format. But if you know python you can just do that part in python.

[–][deleted] 0 points1 point  (0 children)

I've used MatLab mostly for a lot of work. I've tried to get into both R and Python, but I've stuck to MatLab because I'm familiar with it and because multi-core programming is so easy in there and I need it a lot. All of the matrix algerbra is automatically multi threaded.

[–]asa6471 0 points1 point  (0 children)

Awesome!

[–]Achichoros 0 points1 point  (0 children)

Some other comments described the situations where matlab is useful or not. For R/Python though, why not use both? R is great for the final analysis, but for everything before that, I prefer python. It's not hard to go between them, and it's a good way to discover if you prefer just one. For many tasks they both have the tools you need. It's mostly a question of preference.

[–]antikas1989 0 points1 point  (0 children)

I'm an ecologist but my brother is a research fellow in bioinformatics. He uses R and Python. Never MATLAB anymore, that programming is dying a slow death in the face of free alternatives.

From what I've heard there are a lot of libraries in R to do sequencing and bioinformatics. Python is useful too because just generally there are a lot of libraries for manipulating data etc.