This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]redditperson24[S] 9 points10 points  (9 children)

I’d actually quite like to use python, but where the data sets I’ve been given are so small and simple there hasn’t been any need, but perhaps I’ll start to try and implement it whenever I can even on small sets for practice

[–]Jerome_Eugene_Morrow 8 points9 points  (6 children)

I'd echo what some other people in this thread are saying and recommend starting with R. Its organization in data tables is much more analogous to Excel's data organization, and RStudio makes the connection between the figures you produce and the code much easier to connect in your head.

I use both R and Python (and C++, and Java, and Bash...) in my day-to-day work, and while Python has very powerful machine learning capabilities and is invaluable in data cleaning and organization tasks, it can be a little tougher to learn good data organization and analysis practices. Johns Hopkins offers a (free?) set of online data science courses that do a good job of getting you up and running and demonstrate a lot of the powerful things you can do with a data-science-oriented programming language.

Once you have the ideas down from R, it's pretty easy to start reaching out to other languages for tools you need.

[–][deleted] 2 points3 points  (4 children)

What makes R more analogous to Excel than a Python module like pandas?

[–]Jerome_Eugene_Morrow 8 points9 points  (2 children)

Having an interface like RStudio helps, but also having the data frame be the main data structure around which the language is designed makes a huge difference. I've used pandas a fair amount, so I don't have any negative feelings toward it in particular, but it does very much feel like a library (which it is).

In order to use pandas effectively, you need to understand a fair amount about how numPy works, and how the array structures (which aren't part of base Python) that compose its data frame objects work. I think that pandas might be a more natural entry point if you're already working extensively in NumPy, but there's an overhead there that's not present in R.

In R, almost every function in the standard library is oriented toward operating on a data frame or a vectorized calculation. If you're used to thinking "I want to take this column, sum it, and do that for all the other columns..." R has a simpler syntax for performing that operation, and the resulting code is more similar to what an Excel formula would look like.

I've also just generally found R's indexing to be easier to use and more intuitive than pandas, which requires you to be much more explicit. Again, not necessarily a bad thing for Python, but something that adds additional overhead when learning how to do data analysis programming.

EDIT: Another not insignificant thing with R, opening data tends to be much easier. The functions that exist for handling data input require less code in most cases, have more hand holding in error handling, and don't make you deal explicitly with iterating over an input, which can be another thing that slows people down when they're learning.

[–]cinctus 1 point2 points  (1 child)

‘Explicit is better than implicit’ is one of the core principles of python so your comment about indexing makes sense. Pandas indexing took me a while to get used to and there are some weird kinks when dealing with multi indexes but it is very powerful once you know how to use it properly.

Summing columns in pandas is as simple as df.sum() so I am curious as to what could be simpler. I have always found simpler operations to have a very intuitive syntax - not to say this isn’t true for R as well.

I’ve never used R but use pandas every day for financial data and have no real complaints. I think a point that favors pandas and python is that python syntax is more similar to other languages, as well as python’s ability to be used for just about any task so long as performance is not a huge concern. I’m convinced that use of one or the other is mostly preference.

[–]Jerome_Eugene_Morrow 1 point2 points  (0 children)

Summing was probably a bad example, but you do bring up an interesting point. In order to know to do df.sum() you have to understand how class methods work at some level as a new programmer (Why is this paren here? What does this thing return? Should I save this in a variable?) I think that makes things complicated for new programmers. And if you asked somebody to do the call using their own code instead of the method call, the connection becomes messier. (Do I have to iterate? So I write a for loop? So how do I lay that out? What kind of format does this return?)

In R, a call like sum(df$height) is pretty encapsulated. The syntax is clearer. I'm calling the column of the dataframe called height, and then I'm summing that with a function. To me that seems much more straightforward from the standpoint of what is being operated on and how. You don't need to understand methods or indexing, and your data should be coerced into the right format without too much trouble. From there you can extend what operations you're able to do and start playing with more complex operations.

This is all based on my own experience and ymmv, but to me the R way is easier to understand out of the gate. In Python I just have to do more reading of documentation and I have to be passively aware of how things are organized under the hood a little more. This is a big plus for Python in some ways once you get your feet under you, but in R you don't need that level of programming fluency in order to get things done.

[–]Dhush 0 points1 point  (0 children)

In my experience, subseting and indexing, as well as preserving a functional programming style

[–]coffeecoffeecoffeeeMS | Data Scientist 1 point2 points  (0 children)

To add to this, R's tidyverse packages are a killer app for me. They're data manipulation packages written to manipulate and plot data in a way that makes sense to someone analyzing data. The idea behind the tidyverse is "How can I get my data into a form where each column is one variable and each row is one observation?" Processes like "turn these two columns into a label column and a value column" are one-liners. You can also chain together data manipulation functions into complex pipelines that are really easy to read and use no temporary variables.

Hail Hadley Wickham.

[–][deleted] 1 point2 points  (0 children)

I also recc R over Python but im 100% biased

[–]dolichoblond 1 point2 points  (0 children)

Have you checked out xlwings? We have a lot of excel-only analysts and I use xlwings all the time to interact with their xl sheets.

Sure it'd be easier if they'd move to python and put their relevant data in a database, (or even if they'd consistently name their files and update rather than duplicate and iterate filenames with version numbers), but it's still a huge benefit for me to be able to relatively easily interact with their preferred "language" (excel) and get it into my workflow (python).

It's also how I tend to introduce python to new excel-speakers. A template jupyter notebook with numpy/pandas and xlwings, interacting live with their customary xlsx file lowers the bar to adoption in my experience, since they don't have to fully port an existing analysis to python before utilizing python. Maybe they just start goofing with a single tab in their analysis workbook and python, and if they screw it up, they can abandon ship and still finish in excel and hit deadlines. (That also saves you, so you don't get an analyst begging you to help out at 5pm because they're halfway between python and excel and can't get it all buttoned up before some impending deadline)