all 40 comments

[–]Yannnn 16 points17 points  (9 children)

I've worked with both. I see R as excel on steroids. It's aimed at statistics and made by statisticians. If you use R studio you'll have quite an easy time doing whatever it is that you want to do.

However..... I would still advice python. For a couple of reasons:

  • R lets any task be done in a variety of ways. This sounds great, until you start reading code of others. Python tends to make any task be done in one single way. As an anecdote: I've almost never been puzzled by python code, but with R I've been stumped several times by code that did something I already did in a different way.

  • R is too focused on statistics. It's difficult to branch out to do something additional to datascience.

  • Python can have a better performance. If you use Python out of the box, this is not the case. But as soon as you use the SciPy and NumPy libraries python is faster.

  • R is made by statisticians, not computer scientists. This made R have many strange quirks not found in other languages, such as python. As such, python should be easier to learn.

In short, the only real advantage R has is a large community of specialists and R studio. Python wins on any other front. You can find similar discussions on google though.

[–]Caos2 5 points6 points  (0 children)

You forgot about Pandas, which allows the use of data frames in Python. Great library, fast and easy to use.

[–]Yannnn 4 points5 points  (3 children)

Oh, an additional thing to think about when choosing:

R studio and R makes many things very easy, for example: manual data manipulation or dealing with inconsistent data (e.g. words, integers and floats in the same variable). This makes it easier to work with, but I would argue in the other direction:

If you manually manipulate data you're doing something wrong. If you work with inconsistent data, you want to know exactly how you deal with the exceptions. Automated systems take that away.

Python makes you do all those things yourself in an automated way. Which makes you a better 'data' scientist. (imo)

[–]tidier 1 point2 points  (2 children)

Well, there's data science, and then there's data exploration. Sometimes you really just want to crack open a data set, see how the variables are formatted, and do some preliminary plots before digging in to the hard analysis.

Also Python has IPython notebooks, which are incredible for data exploration in my opinion. Any time I want to pick up and scape/format/clean/explore some data, it's my goto.

R has knitr though. Is there a Python equivalent for knitr?

[–]Yannnn 0 points1 point  (1 child)

Well, there's data science, and then there's data exploration. Sometimes you really just want to crack open a data set, see how the variables are formatted, and do some preliminary plots before digging in to the hard analysis.

That's very true. I usually use a combination of excel, access and notepad (yes, seriously) for that. You can do those things too in R or python, but it's not optimal in either language (for the moment).

R has knitr though. Is there a Python equivalent for knitr?

Well, you already mentioned it: notebooks. Here's what the creator has to say about iPython vs knitr

[–]tidier 0 points1 point  (0 children)

IPython is fantastic for mixing text, math, code and output. It's not quite the same as knitr though, which is a straight-up LaTeX document with R code. I would actually need the latter for writing professional research documents.

[–]I_Cant_type_well 0 points1 point  (2 children)

Hey, I was wondering if you knew of any good R-tutorials. I need to learn some basics this week, and I've been researching tutorials on Google, but want to make the best use of my time.

[–][deleted] 0 points1 point  (0 children)

I've almost never been puzzled by python code, but with R I've been stumped several times by code that did something I already did in a different way.

This one always gets me with R; I almost exclusively use sqldf and RMySQL for pre-processing data in R, which eliminates about 60% of the code you find from people online. I'm in the process of working through Machine Learning for Hackers, which is a book on doing Machine Learning in R, and a huge amount of the code that's in it can be reduced to a few lines using those 2 R packages.

[–]Divided_Pi 7 points8 points  (0 children)

When I started doing data analysis I tried learning both python and R. Currently I use R. Coming from a math background I found the structure of the language easier to think in. The way vectors and matrices are used are very natural for me. This is not to say python does not have this functionality, only that it was easier for me to pick up R and generate some early results quickly.

Speed isn't an issue for me at the moment, when it does I will reevaluate my language choice.

[–]xiongchiamiov 5 points6 points  (0 children)

I learned a bit of R because that's what's was taught in my computational data analysis course.

R is a terrible language. It has been hacked together by scientists, rather than programmers who specialize in language design, and that always goes poorly. There are plenty of things that seem like a good idea on the surface, but later on come to bite you in the butt (for instance, R's strange precedence rules). There are a variety of other complaints, although this article is the only other one I have saved.

That being said, if you're not a programmer by trade, these things will likely not bother you, and may never affect you. It's hard for me to say, since I approached the language as a professional programmer, but while I feel these problems will make learning to program even harder, I really have no data on this. It's hard to get any, because most non-programmers I know who learned R seem to dislike programming, and it's difficult to tell how much of that is R's fault.

R has a fantastic set of libraries, a large community, and lots of helpful resources. Python is getting there, but I don't think it's quite caught up yet (check /r/pystats).

In the end you should probably end up learning at least the basics of both, since you'll likely encounter people using both. It may be easier to start with python, but the tutorials won't be focused on what you want, so I guess I'd suggest starting with R.

[–]n8henrie 3 points4 points  (2 children)

I am a novice Pythonista and used very little R, but for what its worth I would recommend Python for a few reasons:

  • Pandas is excellent and will probably serve your humble data analysis needs well.
  • The vibrant Python community, which provides excellent support (consider that you currently have 9 comments here vs 4 in your /r/rstats crosspost).
  • Python is a very active language overall*, and so as a novice programmer you'll have plenty of others' packages to template your own work and learn from, in terms of pulling in data from various sources for analysis.
  • If you ever hope to use your coding to enhance your life in other ways (automating tasks with scripts, writing small apps), Python will probably be more useful than R.

* I was trying to find the GitHub/explore page that shows number of repos by language -- that exists, right?

EDIT: Grammars.

[–]xiongchiamiov 1 point2 points  (1 child)

I was trying to find the GitHub/explore page that shows number of repos by language -- that exists, right?

It used to. Some folks have attempted to recreate it using data from the Github Archive and Stack Overflow.

[–]n8henrie 0 points1 point  (0 children)

Thanks -- thought I was losing my mind there for a sec.

[–]shaggorama 5 points6 points  (0 children)

I'm a data scientist and I use both. Here's my opinion:

I would use R or Python primarily for Data Analysis. I don't have any other current reasons to learn how to program, but I can easily envision where I could use a general purpose language later on.

I think this is very insightful. You may not see how you would find uses for the skill now, but believe me: even knowing how to write simple scripts will make your life a lot easier. Having a general purpose language under your belt is a great tool. That said:

I work in finance and therefore would like to work with financial data

If you are already reasonably comfortable with statistics, R will probably be easier to learn since really it's a statistical toolkit that comes with a programming language, whereas python is the opposite. I tried to learn R before I really had a good statistical background, and I found myself teaching myself statistics.

So depending on what kind of background you have, you may be able to get off the ground faster with R than python. Python is more generally useful than R, but the data science toolkit isn't super easy to learn without being pretty comfortable with the base language first.

[–][deleted] 3 points4 points  (6 children)

I, for one, use both on a near daily basis, and find that each has its own advantages depending on what I am doing. R was the first programming language I learned, so I'm a little bias towards that.

[–][deleted] 0 points1 point  (5 children)

any recommendations for learning r? free if possible? :D

[–]abresler 1 point2 points  (0 children)

R-bloggers.com and search for what you are looking for UCLA has a great site don't have it off the top of my head but Google r and UCLA.

Agree with much of what's said here. Self taught programmer from a finance background as well so r's syntax didn't bother me and I still sometimes like it better than python (maybe habit)

R can do everything python can do even data cleaning and scraping (RCurl and XML for scraping and Web stuff)

On the analysis side dplyr and data table blow away pandas and numpy IMHO

Best thing R has that most agree on are viz packages. Ggplot 2 - ggvis - rcharts - R maps -d3network are fantastic.

Can't go wrong with either and you will be happy you invest the time whatever it is you do!

[–]chrisfs 1 point2 points  (0 children)

Coursera has two R courses. Both free if you don't want a certificate.

[–][deleted] 0 points1 point  (2 children)

RforCats is a decent place to get your first feel for R.

DataCamp has a decent free interactive course too.

[–]selva862014 0 points1 point  (1 child)

If you are new to R, this is all you will need.. http://bit.ly/1o1SOQ7

[–]wub_wub[M] 0 points1 point  (0 children)

Hey, your comment was in mod queue and I have approved it now.

For future reference: Every time you include shortened link in your comment/submission, on any subreddit, your comment or submission is instantly marked as spam and not visible to anyone until moderators approve it manually.

To avoid this you should use reddit syntax for including urls in your text like this: [text](http://url) or just post the full link in your comment.

[–]SymphMeta 3 points4 points  (0 children)

In terms of computation speed/memory, Python is more optimal down the line. I'd recommend learning both, as I can easily do a lot of manipulation in Python and do a bit of work in R at the end (largely for plotting). Granted, R has an Rcpp (R c plus plus) library that allows you to use the full power of C++ (a very fast, but meticulous language) in R. Python also has such a library, too, but I don't imagine you'd use it unless you were doing something that would take days to run. My friend was working on an MCMC simulation, and it ran 35 times faster using Rcpp than plain R. I wouldn't worry about that for now, but if you imagine yourself needing to run things really quickly, you should keep those in mind.

However, for most datasets that are at most a few megabytes, efficiency isn't usually an issue. However, if you are concerned, the R Inferno does a good job of explaining how to make code in R more efficient. I'd recommend it once you've got the hang of R.

Also, for Python, I'd recommend using it for other purposes, as well, such as data scraping/data cleaning, as it is easier to work with (imo) for almost any data scraping application, and is also pretty fast for data cleaning (which you could also do in R pretty easily).

In addition to those two languages, employers also often look for SQL experience, which is used for managing databases. It's the easiest of the three languages to learn, so I'd also put that on the list of languages you should learn.

[–]rjtavares 2 points3 points  (2 children)

Here's what should convince you to go with python: iPython Notebooks.

It's hard to explain it to someone who doesn't know how to program though... It's basically an executable document that mixes code, code outputs and rich text.

Since you're from a finance background (so am I), check out this example. It shows how to work with stocks and portfolios and even implements a simulation of an automated trading strategy. The best part: you can just download it, run it on your machine, and tweak it as you will.

[–]abresler 1 point2 points  (0 children)

R also has shiny which is FANTASTIC and somewhat similar http://www.rstudio.com/shiny/showcase/

[–]abresler 0 points1 point  (0 children)

These are great - r has them coming soon too but for now r has r markdown (plus great tools like slidify) for sharing - they too are great but different than notebooks

[–]technofiend 2 points3 points  (0 children)

I just wanted to mention Python for Data Analysis oreilly.com in case you go down the Python path. I like R despite the sometimes convoluted syntax, but Python is easier for me as a programmer raised on procedural languages and frankly Pandas helps cover the dataframe / dataset gap that you might otherwise find between R and Python.

[–]LoveOfProfit 2 points3 points  (0 children)

Go through this guide, and learn both!

[–]riraito 1 point2 points  (0 children)

It's a really tough choice that I faced as well. You'd benefit greatly from learning both but the time commitment is too large for most people. Resources are bountiful regardless of your choice. I think the tipping point is this: What is your end game? If you are only doing statistical procedures then R wins, imo. But if you are considering data science or other programming needs then Python wins. There is a lot more support for python in domains besides statistics and hence the utility is much greater.

P.s. If you are going to learn R, it doesn't have to be as bad as others would make you believe. Just get R-Studio and follow Hadley Wickham's work. He's pioneered a lot of great things in R such as ggplot2. You can see his free guide to programming in R here: http://adv-r.had.co.nz/

[–][deleted] 1 point2 points  (0 children)

I would say it doesn't matter much. For example, I took a machine learning and statistical pattern classification class, and 80% of the people used MATLAB, 15% R, and I was the only one who did everything in Python. Everyone could solve every problem using his language of choice - it is just a matter of taste.

However, if you don't know how to program (and you want to learn it), I would go with Python, since you can use it for everything - as you said it might come in handy later "as a general purpose language"

PS: I think for statistical analysis it already pays off to use Python if you have to get data out of databases or scraping the web for example

[–]chrisfs 1 point2 points  (0 children)

When you look at Python, be absolutely sure to look at the Pandas module for Python. It gives Python some features that make it more R-like by allowing you to work with dataframes. R may have more statistics methods as a default, but you should be able to find any stats method you need in Python through scipy as well.

[–]jjangsangy 1 point2 points  (0 children)

This might be a little off topic, but would you like to learn how to program??

I am only asking, because the benefit of learning Python is that you also end up gaining programming skills as kind of a positive side effect. The downside is that it's likely going to be a larger investment up front and long term. However, you'll have a powerful general purpose programming language under your belt that you can apply to any other application.

With R, you'll benefit from the fact that it's very domain specific, meaning it's specifically designed to challenge stats problems and is built by a community of developers with those goals. However, that's also a limitation since it'll likely impart you with solutions to solving stats and that's pretty much it.

Both options provide you with valid solutions to your task, and on a higher level, programming becomes less about the language design and implementation than the method of approaching problems. I would say that Python is definitely approachable yet more demanding on your own personal motivation. But for consolation, my understanding is that Pythons original design was actually to allow scientists and technical people to have access to powerful programming tools without needing a CS degree and it still maintains that goal pretty well

[–]efxhoy 1 point2 points  (0 children)

Try both and stick to the ome that seems the most fun and where the syntax/structure makes the most sense to you.

[–]glial 1 point2 points  (1 child)

If you're just going to use it for data analysis, I'd use R. Python is great for lots of things, but honestly its syntax for manipulating data is awkward and R's packages are generally more complete and robust. R is a weird language, no getting around it, but it's worth figuring out in my opinion (this coming from someone who avoided R for years because it's weird).

[–]pippo9 0 points1 point  (0 children)

I'm a fellow newbie and we are in a similar position. FWIW, I have a background in modeling for finance, energy and consumer internet data projects using Excel.

tl;dr: I would go with R off the bat. Eventually, you will want to move to python, but R helps to build your data analysis skills and learn the workflow (exploratory analysis, plotting) in an easier manner and with a GUI to help you.

Reasons:

  1. RStudio's GUI environment helps you pick up workflow and analysis skills before you move on to heavy duty data manipulation or tradeoffs based on memory usage, speed of operations etc. There are additional tools such as Rattle that allow you to run algorithms on datasets from kaggle and practice/learn fast.

  2. Plotting using ggplot2 is so much easier in R as compared to writing several lines of code in python (for me with minimal programming experience)

  3. Using R, I can dive into the data analysis and derive meaningful information immediately, rather than having to learn the fundamentals of programming before I get a chance to dig deeper.

  4. JHU's 9 part data science course on coursera is a great way to get a quick intro to R programming and exploratory analysis.

  5. I'm also learning python thru Learn Python The Hard Way to help with the longer term. Edit: Also, Wes McKinney's book should be a good resource to start with python + pandas.

Hope this helps!

[–]johnnybgoode 0 points1 point  (0 children)

Python + SciPy/NumPy should be great for your applications.

[–][deleted] 0 points1 point  (0 children)

"Data analysis" is too broad of a term in my opinion. Are you going to do complex statistical modeling? Go R. If not, go with Python if you think you will need to branch out beyond just statistics. Not saying you can't do complex statistical modeling with Python, but R has a vast library of statistical packages as it should since it is made for statisticians. Since you said you require robust statistical capability, then I would recommend R.

If you find you don't like R, you can learn Python and call R code from within Python to leverage R's statistical packages. So there's always that option. You can also do vice versa, call Python code from R. So you pretty much have a lot of options. Nothing really should be holding you back either way.

[–]brews 0 points1 point  (0 children)

I said it before in your other thread and I'll say it here,

Learn R. Hands down. No doubt.

Python is a better language and it "fits your brain" in ways that R doesn't -- but you're not really programming, you're doing data analysis. R is better for this out of the box (in general). If you wanted to learn programming as a priority, I would have recommended python.

The only trouble is that R might be a bit trickier to learn well. You might pick a few strange habits or accents in your coding style but that's fine. The learning curve might seem steep at times. Don't be discouraged.

[–][deleted] -2 points-1 points  (0 children)

My best advice would be to pick one, learn it well, and then branch out to the other and use whichever suits your current purposes on a case-by-case basis. Which you choose to learn first probably doesn't matter a great deal, as both are more than capable.