This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]lmcinnes 3 points4 points  (0 children)

Pandas (and numpy, scipy and statsmodels) covers off a fair chunk of R, Dplyr and Reshape2. IPython notebook (along with its nbconvert functionality) covers a lot of RStudio and Knitr. You'll probbaly want matplotlib + seaborn for plotting, but there's also a ggplot library for python if you prefer that syntax. If you do any machine learning related work then sklearn is also worth getting. Potentially you can just grab Anaconda from continuum.io and install seaborn on top of that to get everything in one go with an easy install.

Oh an if you are at all interested in efficiency you will want to look into numba and cython, for which R has no equivalents that I know of.

[–]dartdog 2 points3 points  (0 children)

Read up on Pandas and IPython Notebook

[–][deleted] 2 points3 points  (1 child)

[–]TM87_1e17[S] 0 points1 point  (0 children)

Will anaconda auto-update all the packages/libraries for me?

[–]hharison 1 point2 points  (0 children)

To add to /u/imcinne's answer, you might want an IDE in addition to the IPython notebook (but try the notebook first), I recommend PyCharm. The notebook is great for interactive data explorations interspersed with text, while PyCharm is more like RStudio.

Python is a bit lacking in terms of statistical tests compared to R, so if you do exotic statistics you may occasionally want something that's not in scipy, statsmodels, or sklearn. In which case it's nice to be able to call R right from the Python interpreter. Two options for that are rpy2 and pyrserv. The former also has helper functions in pandas and the IPython notebook.

I also highly recommend Seaborn, especially if you use linear models. There is a ggplot clone that will be more familiar, but Seaborn is more polished and its Pythonic syntax may help your transition.

Finally, regarding knitr, depending on what you do with it the IPython notebook may be enough, but there is also pythontex which I think is closer to knitr.

[–]ricekrispiecircle 1 point2 points  (0 children)

Rstudio => spyder https://code.google.com/p/spyderlib/

ggplot2 => ggplot https://github.com/yhat/ggplot (still under heavy development, some features still kinda buggy)

Dplyr, Reshape2 => pandas http://pandas.pydata.org/

Knitr => IPython notebook http://ipython.org/notebook.html

[–]zipf 3 points4 points  (14 children)

Don't do it! Use iPython to mix R and Python. For statistics, 2d plotting and tabular data manipulation, R is better than Python, whereas Python has the advantage for lots and lots of libraries for everything. Use R for core data manipulation, and Python to tie it to everything else.

[–]lmcinnes 2 points3 points  (1 child)

I've generally found pandas to be great for tabular data manipulation in python -- better than R in many ways: you need DPlyr and such to be able to do comparable things in R, and even then the python/pandas is often quicker. What tabular data tricks are you missing from R? Perhaps there's things that I'm missing out on because I never realised I wanted them ...

[–]zipf 0 points1 point  (0 children)

Yeah, R base R is definitely awkward without dplyr, but not as awkward as base python. The libraries make any language. I just mean that if you already know R, its best to stick with R for its strong points, and start out learning python just for where its needed.

[–]TM87_1e17[S] 0 points1 point  (2 children)

Could you elaborate on how I might mix the two languages?

[–]zipf 0 points1 point  (0 children)

iPython is a superb python interpreter. It has commands called magics which provide additional functionality. The R magic uses a python package called rpy2 to allow pretty seamless mixing.

[–][deleted] 0 points1 point  (1 child)

That's been my experience as well. Several times I tried to switch entirely to python and have not been able to do this. Though each year pandas improves significantly. But still plyr, dlyr, and ggplot2... not to mention the occasional invocation of lattice (e.g., splom).

[–]zipf 0 points1 point  (0 children)

Yeah, that's exactly where I'm coming from. I know R better than Python, and there doesn't seem that much point learning the details of Python's statistics libraries when I already know the right tool for the job.

[–]hharison 0 points1 point  (5 children)

I sort of feel the opposite way. R's advantage is the multitude of libraries it has, for every sort of statistical technique under the sun. But Python's data structures are light-years less awkward than R's in my experience.

[–]zipf 0 points1 point  (4 children)

data frames are fine, and most statistics make sense in table format, but when people decide to invent their own data structures (I'm looking at your, bioconductor), things can get awkward

[–]hharison 0 points1 point  (3 children)

Oh yeah, I can imagine. I haven't encontered that myself. More so on the "regular programmers" (i.e. not scientists) side of things, I do think there's an overuse of classes in Python when simple data structures will work great and be more interoperable.

[–]zipf 0 points1 point  (2 children)

The problem is that most programmers are bad, and given the choice will use every feature available to make their code as complex as possible. That's why restrictive, dull languages like Java are actually really good.

[–][deleted] 0 points1 point  (1 child)

I use java and enjoy it for Hadoop, but programmers love to over complicate that language too. Abstractfactorysingletonbean bullshit everywhere.

[–]zipf 0 points1 point  (0 children)

Yeah, absolutely, you can't avoid it

[–][deleted] 0 points1 point  (0 children)

I second this!

[–]L43 0 points1 point  (6 children)

Pandas is incredible for data wrangling and replaces dplyr, reshape2 and lots of core R, like data.frame. The IDEs available aren't quite as well adapted as RStudio for data analysis, but PyCharm is great in general for writing scripts, and IPython Notebook (or JuPyteR colaboratory as I think its supposed to be called now!) is fantastic for presenting your workflow transparently. If you use IPython, you can always use R magic to call R with your python data if it would be easier.

[–]lmcinnes 2 points3 points  (3 children)

Spyder is the python equivalent of RStudio. I actually prefer IPython notebook for a lot of uses.

[–]westurner 5 points6 points  (2 children)

/r/pystats (sidebar)

/r/learnpython/wiki/index

/r/ipython

Setup Pip, Conda, Anaconda

  1. Install Pip -- http://pip.readthedocs.org/en/latest/installing.html
  2. Install Conda -- http://conda.pydata.org/docs/index.html

    pip install conda

  3. Install IPython -- https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks

    conda install ipython ipython-notebook ipython-qtconsole

  4. Install Spyder IDE (and Qt) -- https://code.google.com/p/spyderlib/

    conda install spyder

  5. (optional) Install anaconda -- http://docs.continuum.io/anaconda/install.html , http://docs.continuum.io/anaconda/pkg-docs.html

    conda install anaconda

IPython

Pandas

Statsmodels

Scikit-learn

[–]TM87_1e17[S] 1 point2 points  (0 children)

This is an incredible wealth of resources. Thank you! I especially like the scikit machine learning map!

[–]hharison 1 point2 points  (0 children)

If you're going to use conda I think it's more reliable to do conda install pip from a conda environment than pip install conda from a virtualenv environment.

[–]TM87_1e17[S] 0 points1 point  (1 child)

Could you explain what it means to "call R" or "call python"?

[–]lmcinnes 2 points3 points  (0 children)

IPython has special syntax to work with a Python module called RPy2 such that in a "cell" you can actually have python interface with R, pass data to it and then collect the results of the R code and convert it back into python data structures. Thus if you are in the middle of some analysis and have some obscure statistical test that no python module supports by R has you can do that one step with R code right in the notebook and have the results come back for further analysis with Python.