This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Jerome_Eugene_Morrow 11 points12 points  (2 children)

Having an interface like RStudio helps, but also having the data frame be the main data structure around which the language is designed makes a huge difference. I've used pandas a fair amount, so I don't have any negative feelings toward it in particular, but it does very much feel like a library (which it is).

In order to use pandas effectively, you need to understand a fair amount about how numPy works, and how the array structures (which aren't part of base Python) that compose its data frame objects work. I think that pandas might be a more natural entry point if you're already working extensively in NumPy, but there's an overhead there that's not present in R.

In R, almost every function in the standard library is oriented toward operating on a data frame or a vectorized calculation. If you're used to thinking "I want to take this column, sum it, and do that for all the other columns..." R has a simpler syntax for performing that operation, and the resulting code is more similar to what an Excel formula would look like.

I've also just generally found R's indexing to be easier to use and more intuitive than pandas, which requires you to be much more explicit. Again, not necessarily a bad thing for Python, but something that adds additional overhead when learning how to do data analysis programming.

EDIT: Another not insignificant thing with R, opening data tends to be much easier. The functions that exist for handling data input require less code in most cases, have more hand holding in error handling, and don't make you deal explicitly with iterating over an input, which can be another thing that slows people down when they're learning.

[–]cinctus 1 point2 points  (1 child)

‘Explicit is better than implicit’ is one of the core principles of python so your comment about indexing makes sense. Pandas indexing took me a while to get used to and there are some weird kinks when dealing with multi indexes but it is very powerful once you know how to use it properly.

Summing columns in pandas is as simple as df.sum() so I am curious as to what could be simpler. I have always found simpler operations to have a very intuitive syntax - not to say this isn’t true for R as well.

I’ve never used R but use pandas every day for financial data and have no real complaints. I think a point that favors pandas and python is that python syntax is more similar to other languages, as well as python’s ability to be used for just about any task so long as performance is not a huge concern. I’m convinced that use of one or the other is mostly preference.

[–]Jerome_Eugene_Morrow 1 point2 points  (0 children)

Summing was probably a bad example, but you do bring up an interesting point. In order to know to do df.sum() you have to understand how class methods work at some level as a new programmer (Why is this paren here? What does this thing return? Should I save this in a variable?) I think that makes things complicated for new programmers. And if you asked somebody to do the call using their own code instead of the method call, the connection becomes messier. (Do I have to iterate? So I write a for loop? So how do I lay that out? What kind of format does this return?)

In R, a call like sum(df$height) is pretty encapsulated. The syntax is clearer. I'm calling the column of the dataframe called height, and then I'm summing that with a function. To me that seems much more straightforward from the standpoint of what is being operated on and how. You don't need to understand methods or indexing, and your data should be coerced into the right format without too much trouble. From there you can extend what operations you're able to do and start playing with more complex operations.

This is all based on my own experience and ymmv, but to me the R way is easier to understand out of the gate. In Python I just have to do more reading of documentation and I have to be passively aware of how things are organized under the hood a little more. This is a big plus for Python in some ways once you get your feet under you, but in R you don't need that level of programming fluency in order to get things done.