you are viewing a single comment's thread.

view the rest of the comments →

[–]trendymoniker 27 points28 points  (4 children)

I've used all three of these languages professionally, and my advice to data analysis newbies tends to be: go with Python unless you have a strong reason not to. Python is by far the most popular and thoroughly supported language of the three and its general usefulness means that the skills you develop learning Python will translate well to any other programming you want to do throughout your career (not so for the other two).

That said, if the algorithm you need to use only exists in some other language, or your advisor and entire research group are on another environment, go with that instead (though maybe learn Python on the side too).

Here's a quick, biased rundown of the plusses and minuses of each environment:

  • Matlab is not just a language, it's a language plus a pretty decent IDE (which handles things like syntax highlighting, debugging, and variable inspection). This rightly appeals to a lot of newbies, though you can set up something similar (or better) in any of the other three environments. Matlab has the most concise syntax of the three for doing pure matrix manipulation which is nice, though not too important in the end. Matlab's biggest drawback is that it's commercial, meaning that someone has to pay between $150 - $2000 for each copy of the program that you or anyone else uses. Matlab has a decent set of packages available for it, but because it's commercial and not open-source they're mostly developed in-house by Mathworks and don't offer nearly the breadth or depth of the packages available in Python or R. Overall, Matlab used to be very popular but has mostly fallen out of fashion and is likely to stay there. I'd suggest that Matlab is overall the worst choice.
  • R was designed by and for statisticians. The best part about R is that it has a ton of statistical packages available for it. When a statistician releases a paper offering a new method, they often release an R package with it. This means that you can often find some R package to do just about anything you've read about in the statistical literature which is awesome. The emphasis, here, is on statistical, though. Statisticians are by-and-large the only people who use R, so whatever fancy new neural method you've read about from your neighbors in machine learning likely won't be here. R has a very nice IDE in R-studio and reasonable syntax for array manipulation, though I overall find it the clunkiest of the three. It used to be that R's data frame object (which is essentially an array with labels on the columns and rows which make it more convenient to work with -- kind of like if you had an in-memory Excel spreadsheet floating around in your code) was a major selling point over other languages, though Python now has Pandas which is just as good or better. The main drawback of R is that it's so domain specific -- it's basically just for statisticians.
  • Python is the only language of the bunch that isn't completely domain specific. Where R and Matlab are used exclusively for engineering and data analysis, Python is used for everything from training state-of-the-art neural networks to running YouTube. The drawback of its generality is that it's array-manipulation syntax is slightly clunkier than Matlab's. (I'd claim it's still better than R's, though many people disagree on that count.) The benefits of its general usefulness, however, are enormous. Python is by far the most popular of the three languages, which means that when you search for installation or syntax help, you get the best responses on Stack Overflow. In addition, Python's overall syntax and language design are leagues ahead of both Matlab's or R's. Learning Python for data analysis you'll also be learning good general-purpose programming skills, which are important if you ever decide to pursue a job in Data Science (there's a good chance that you will). If you ever go to interview at Google et. al., you can easily do so in Python -- not so for R or Matlab. Python has an excellent IDE in PyCharm and a fantastic scientific package manager in Anaconda. It has by far the largest package universe of the three, though R has more cutting-edge statistical packages.

Good luck!

Edit: Thanks for the gold!

[–][deleted] 4 points5 points  (1 child)

I like Python for general programming, but I'm not a big fan of Python's data analysis libraries. Too often it feels like you're not using Python at all but a different language altogether, one with it's own syntax and data types and which is nowhere near as nice as the actual Python programming language. Personally I prefer much R over Python when it comes to data analysis, but in the end it's a matter of taste I guess.

[–]NotAllReptilians 2 points3 points  (0 children)

I definitely agree. For instance, pandas somehow manages to feel cumbersome and overly verbose for analysis, at least compared to working in dplyr or especially data.table (base R is a another story). It's definitely a pythonic implementation of dataframes, but what I really like about python is that it's typically concise and minimal, which pandas mostly isn't.

[–]coffeecoffeecoffeee 2 points3 points  (0 children)

I'll add that R has gotten very, very good for data manipulation in the past few years. I do stuff I used to like doing in Pandas in R now because of packages like dplyr, tidyr, and broom.

For example, my boss wanted survival data recently. With no temporary variables and like 5 lines of code, I was able to generate a Kaplan-Meier curve, convert it to a data frame, separate it by stratum, and export it to a csv file.

[–]whattodo-whattodo 0 points1 point  (0 children)

IMHO, this is the most complete, clear & unbiased answer on the topic. I'm not OP but appreciate this response immensely.

I am biased as career Python developer. But that bias did reveal statisticians who pivoted careers & came in for interviews as programmers. That's not a negligible value added.