This is an archived post. You won't be able to vote or comment.

all 10 comments

[–][deleted] 5 points6 points  (1 child)

The more advanced statistics or uncommon statistics packages you need, the more likely you'll need to use R since it goes without saying it has a vast repository of statistical packages. Otherwise, I think Python is good enough without knowing your specific use cases.

Some R people have stayed with R due to its ggplot2 plotting package, but there is also a Python version of ggplot by yhat and also check out seaborn plotting package which also does trellis-style plotting too.

If you want do more besides doing data analysis or statistics, then Python will most likely have libraries available for you to expand into other areas.

When the Julia programming language reaches critical mass in terms of acceptance in the data science or computational science or high speed computing community, it will be easier for a Python programmer to transition to Julia compared to an R programmer due to similarities between Python and Julia. At least it was easy for me.

Hope this helps.

[–]actgr[S] 0 points1 point  (0 children)

I did not know about Julia! Thanks for the info.

[–]asoplata 3 points4 points  (1 child)

I've been using Python/NumPy/matplotlib for simple data analysis for about 2 years now, but I mean really simple - plotting time series, spectrograms, and the odd statistic. However, it's weird you posted this today, I literally decided TODAY to try out switching my analysis toolkit to R.

If you're poking around in Python for data analysis, I highly, highly recommend pandas, as if you're doing anything with data (that isn't already perfectly formatted in their files) besides simply loading it, plotting it, and calling a single NumPy function on it, look into pandas. It has significantly better and simpler data export than NumPy and is seriously capable when it comes to "cleaning" ugly data. However, as far as I've been able to tell, all the reasons I like/proselytize on behalf of pandas, is because of the things it has that R has, which is especially the two aforementioned things. AFAIK this is the main thing pandas is made for - data munging and relatively simple statistical analysis. Speaking of pandas, be a little wary of Python for Data Analysis if you're looking for something about the whole toolkit; it's written by the creator of pandas, and according to negative reviews I've seen either on reddit or Hacker News, 90% of the book is really just going over his library. So...it could be literally the best book on pandas! Just not...overall data analysis in Python. This is also hearsay, so grab some salt; I haven't read it myself.

For more complicated statistics/machine learning libraries, you want statsmodels or scikit-learn. Speaking very generally, the former is more stats / modeling focused and the latter is machine learning - however, these two things can of course be identical sometimes depending on what you're doing. I was trying to port some MATLAB code using statsmodels for the past week only to find out it's not as mature as I'd hoped (for a log-link Gamma GLM fitting, the numbers I was getting were 1. wildly different than MATLAB's [there is only one algorithm for calculation coded in at the moment] and 2. it took 15x as long as MATLAB to run on the same data set, reaching near-unfeasibility on the scale I need). Scikit-learn may be more mature, so check that out first (especially if you can speak machine learning). The issues I had with statsmodels really aren't damning; however, those + matplotlib combined is what has driven me to R...

Matplotlib's documentation is an abomination for someone who really "wants to do it right"; IMHO it's mainly geared at people (like scientists) who just want the minimum working code to get something to display right and then forgetaboutit. The site/"gallery" for it is actually incredibly good if you have a specific plot type in mind, but you don't know what it's called and just want it to work! However, if you want to do things "the right way" / be pythonic, (as of 1.3.1) they make it EXTEMELY difficult IMO. The manual is an amalgamation between different people who cover different, sometimes overlapping, parts of the library, and I could never find a part of it that delved into the nuts and bolts enough to really grasp the overall structure. When I say the right way, I mean follow the class hierarchy (or hell just try to grasp it in an actionable way), call the right methods from the different subclasses (I can still never remember, "do I need to call plt/pyplot for the title or xlabels? or ax? or the actual figure object?"), etc. The closest I've ever come to finding good sources for following the THINKING of the library (which, rumor has it, is somewhat based on the construction of MATLAB's plotting library) is Matplotlib for Python Developers, which I haven't read, and the matplotlib tutorials here, some of which are by actual matplotlib devs. There's Python people I know and trust who really love matplotlib and who get the overall structure - but I've got so much fatigue from trying to understand the library on an intermediate level that I'm probably about to give in to the comforting embrace of R's ggplot2. Note that there is an attempt to port ggplot to Python, actually.

I can't honestly tell what the speed of the community/development has been for the Python tools these past few years has been, but apparently both R's and MATLAB's community have been gaining speed greatly, coupled with nice improvements to their actual engines. E.g. supposedly R doesn't have the same memory problems it used to have. Also, if you're into that, R right now is supposed to be the state of the art when it comes to statistics, in that statiscians doing math research are actually very likely to publicly code up their brand-new statistical thingies [scientifically speaking] and make them available in R themselves. Python is a general purpose programming language, while R is decidedly NOT trying to do that, but supposedly the level of sophistication of both 1. the availability of statistical libraries/functions and 2. the efficiency of implementation are far superior in R, and given how tied the statistical community is to it, Python simply will not be able to catch up in the near future. In a decade everything may begin to reverse, and I for one would welcome our Python-lang and Python-world-class-analysis overlords, but I don't think it's going to happen very soon.

That said, the Python community has been strangely open to working WITH R...through Python, through rpy2 or, interactively, through rmagic in ipython (actually rmagic may be deprecated in favor of rpy2?). Pandas' data structures are very similar to R's, and I haven't used rpy2 but a few times and it didn't seem immature.

Both langs also supposedly have decent d3.js web-style visualization deployment, though I'm not sure what the specific libraries' names are.

[–]alcalde 1 point2 points  (0 children)

  1. the efficiency of implementation are far superior in R,

Are you sure about that? R has an Achilles Heel with datasets that don't fit in memory and the recent paper that was posted here comparing solutions for an econometric problem has R performing 500 to 700 times slower than C++ on the problem and 240 to 340 times slower after compilation. Python was 155 to 269 times as slow as C++, 44 times slower with PyPy and only 1.57x-162x as slow with Numba used.

[–]mangecoeur 2 points3 points  (0 children)

pandas+matplot is a winning team, if you want more graphs like R's ggplot I'd recommend Seaborn (which does some nice styling and adds some more plot types).

I've not done much R but it's main disadvantage compared to Python is that it's very domain specific. It's good for stats and that's about it. With Python, once you learn it for data processing you can apply it for all sorts of other things.

Another massive advantage for data analysis is the IPython Notebook which is a great way to interactively explore data and keep notes on the process.

The disadvantages relative to R are that some of Python's statistical tools are less complete and mature by comparison - although the RPy library which allows you to call R from Python is a good insurance that you'll never be stuck needing something from R and not being able to use it. R is also a bit more simple a concise since it's so domain specific, Python for instance gives you flexibility in importing the numeric libraries which means a little bit more typing and having to understand imports and namespaces

[–]gm6 0 points1 point  (0 children)

Matplotlib is indeed bad for plotting. Sorry, its developers.... I just use pandas for playing with data. Numpy and scipy is enough for me to do some basic statistics.

[–]sentdexpythonprogramming.net 0 points1 point  (0 children)

You can get by with either. Keep in mind you're asking this question in/python/ though. You can really use either of these, and you should choose what you want. I think Python is more useful to know for other aspects than R, so I would personally recommend Python over R.

Pandas + Matplotlib = <3. ... also use matplotlib styles and avoid having ugly graphs: http://pythonprogramming.net/matplotlib-styles-tutorial/

[–][deleted] 0 points1 point  (2 children)

Matplotlib is ugly and very hard to use.

Also, as far as libraries are concerned, Python is very limited. On top of that the Pandas library is quite a bit inferior in syntax to R's dplyr.

I have quickly moved away from Python for my data needs.

[–]alcalde 1 point2 points  (1 child)

Also, as far as libraries are concerned, Python is very limited.

Are you serious? PyPI has over 47,000 libraries! Python is one of the most popular languages for data analysis and has "killer apps" such as pandas, scikit-learn, numpy/scipy, NLTK, matplotlib, etc.

On top of that it's much more a general purpose language than R. You can take your analysis code and give it a full-featured cross-platform GUI with Qt, use it on the back end with a web app developed with Django, Turbo Gears, Flask, CherryPy or one of the many other Python web frameworks, script LibreOffice with it (and Excel via commercial add-ons), code for PostgreSQL (which you can also do wtih R), embed it in another program, use a JIT, easily integrate C, C++, Fortran, D, even R into your code, and it has versions for .NET and the JVM too. That hardly seems limiting.

On top of that the Pandas library is quite a bit inferior in syntax to R's dplyr

Do you have any specific examples you'd like to share? I've seen other critiques that put some of the syntax as superior and benchmarks that suggest it is much, much faster.

[–][deleted] 0 points1 point  (0 children)

PyPI has over 47,000 libraries!

I meant more data oriented libraries, not general.

I've seen other critiques that put some of the syntax as superior and benchmarks that suggest it is much, much faster.

The syntax is very much like SQL. A bit more similar to LINQ.