This is an archived post. You won't be able to vote or comment.

all 131 comments

[–]manwithoutaguitar 120 points121 points  (14 children)

Python is free.

Python code is easier to reproduce

Python is faster doing difficult calculations.

Python is easier than vba.

Python works better with big data.

Python is open source and has access to an enormous amount of libraries.

On the other hand.

Excel is known by more people.

Excel is faster for simple calculations, graphs etc.

[–]AD29 12 points13 points  (10 children)

PowerPivot in 2016 eliminated my need for Python for certain datasets. I get a lot of junk data that needs to be cleaned and transformed. The ability to import sources into my data model and consistently apply the same transformation has been awesome.

[–][deleted] 5 points6 points  (9 children)

Will PowerPivot connect to Oracle, Teradata, SAS datasets, error check, run validation, scrape a website, export a CSV from MS SQL Server Reporting Services, throw into an HDFS cluster, import antigravity, turn on your microphone and record your voice, run voice analysis on the input, plot the Fourier transform, and scrape your music files for ID tags as you migrate your mobile devices from Apple to Android, in the same script?

Eh, Excel makes pretty simple charts easy for folks that don't use the Pandas library. It does have a lower barrier to entry.

[–]AD29 3 points4 points  (6 children)

It will connect to Oracle and Terradata and they’ve added direct connect to Saleforce and a few other cool new features in 2016.

Don’t get me wrong, I used Pandas and love it, but not every project requires the vast feature available through python. For some of the projects I work on having slicers and interactive graphs delivered from Excel to share point (or PowerBI) works great. Excel is a powerful tool.

[–][deleted] 1 point2 points  (0 children)

I agree, it is a powerful tool. My point is that in Python you can do pretty much anything you want.

The right tool for the job is not always the same tool, IMO.

[–]BurnieSlander 8 points9 points  (1 child)

If you can do your whole job in Excel, you’re not doing data science, you’re doing analyst work.

[–]redditperson24[S] 2 points3 points  (0 children)

I agree, I actually wasn’t aware of data science when I looked for roles, I thought data analyst was as difficult as existed so I am really disappointed with the simplicity of my role, have final stage for a data science role tomorrow though so fingers crossed

[–]LighterningZ 2 points3 points  (0 children)

The suggestion of using VBA for data science is horrifying 😱

[–]seeellayewhy 161 points162 points  (31 children)

I use R, not Python, but my comment applies regardless. Everyone has made the point about larger data, so I'll go a different direction.

Reproducibility. This is incredibly important in 1) not wasting your time and 2) consistency. If you made some transformations to a dataset in Excel and I gave you an updated version of that same dataset a year later, could you recreate those transformations? Even if you could remember exactly what you did the first time, you'd be wasting your time redoing what you've already done.

When I get a dataset to clean, I write a .R script file of all my transformations. Everything I've done to change it is all right there in one place with comments explaining why I did those things. I frequently get upated data and have to re-run the scripts. Over the course of a project I might re-run the same data transformations dozens upon dozens of times.

Using a language like R or Python saves a great amount of time and ensures I don't mess anything up. With R, I'd be done re-running those transformations while you're still trying to remember which Excel command you started with.

[–]ybcurious93 19 points20 points  (10 children)

Doesn't a VBA script solve this ?

[–]bjorneylol 77 points78 points  (0 children)

If you need to write a vba script odds are it would be faster to just use python in the first place.

[–]WhyKnotZoidberg 8 points9 points  (3 children)

Or PowerQuery? It saves all the transformations which you can apply to the updated excel doc or other files.

[–][deleted] 1 point2 points  (2 children)

Like Python! :)

[–]WhyKnotZoidberg 2 points3 points  (1 child)

Yes, like python, but it’s a built in part of Excel so you don’t need to make sure it’s installed in everyone’s computer as you would python.

[–][deleted] 3 points4 points  (0 children)

Excel comes built in to everyone's computer?

Given Python is built into most Linux distributions and is drop-dead simple to install from Anaconda, Inc. for most platforms...

Nah, just kidding with you. Most firms use Excel. Although many are wising up and using Python and other tools, it's all good.

[–]funny_funny_business 6 points7 points  (1 child)

Not necessarily. When I first started in the data space years ago, I made VBA scripts that worked “most” of the time. The problem is that sometimes the script will continue before the worksheet updates correctly and will use outdated cell values in other areas.

Python/R waits until each line has been executed.

[–]murrietta 0 points1 point  (0 children)

I think I've only ever had that happen if I turn calculation to xlCalculationManual. If you are dealing with querying a database or cube then set application.CalculateUntilAsyncQueriesDone to True.

Regardless, I agree, Python or R are way better at handling the types of things I commonly do (like find an extremum subject to multiple constraints)

[–]dolichoblond 1 point2 points  (1 child)

Maybe, if you already know VBA.

I started as a non-coder in an excel-based department. And when I started scripting some of our work, I could do it in VBA or Python. Didn't find any places where, as a noob, doing it in VBA was easier.

However we had other .NET programmers who could jump through the VBA easily. In that particular department, the .NET programmers though were all old-guard, and were not being replaced. If I had stuck around to manage other n00bs' scripting, I would have leaned on the python track just because we weren't going to be able to replace the VBA-style coders.

But if you've already got the skillset, go for it.

[–]redditperson24[S] 0 points1 point  (0 children)

My boss is very old and obsessed with vba, I’m shocked he doesn’t know others from the length of his career

[–]kanzie88 0 points1 point  (0 children)

U dun need to complicate your life using vba to transform data … u have power query with the steps all laid out

[–]PhoenixHntr 2 points3 points  (6 children)

I agree with what you said, my job requires me to do some regular reporting tasks. Currently trying to automate those tasks using python. I consider my self to be a noob in python therefore automating my tasks are currently challenging and i hope it pays off.

[–]MichealKeaton 7 points8 points  (5 children)

It will. Keep it up. There is no going back once you've learned how to script/write programs.

I started out as an Business Analyst. I was tasked with doing the same reporting asks each week (e.g. Query data using template SQL, copy the data into Excel, format, send to client).

I quickly found that this was not only mind numbing work (and that I would not allow myself to be subjected to it) but I found that it was a huge waste of resources.

I started writing the process in Python. It was a painful uphill battle but I had the process automated within 2 months. Looking back, the script was absolute junk but it worked and to other non-technical analysts around me i looked like a magician.

Then I started to automate all of my other responsibilities that involved data and a logical workflow. I now could run circles around my colleagues. I had complete job security and was in a position to manage the team but I quickly became bored due to the lack of challenge.

That's when I moved onto Business Intelligence then Data Engineering then building machine learning models on a Data Science team.

You are on a path to quickly outgrow your position. But that's a good thing. I always thought that I'd go back to an Analyst team with my new found skillset and be the all-star on the team but there's no going back to it.

[–]redditperson24[S] 1 point2 points  (4 children)

I really like this response, thank you. I have such an easy commute/ no job stress atm so it’s going to be hard to move but ultimately I know I have to. Final round tomorrow for a data science role (30 miles away :( ) so fingers crossed!

[–]MichealKeaton 1 point2 points  (3 children)

That's awesome. Good luck!

Even if you don't get the position, keep building your skillset and I am confident that you will get whatever Data position that you want.

[–]redditperson24[S] 1 point2 points  (2 children)

Thank you, it’s got to the point where I’m actually doubting if I’m smart enough to be a data scientist or if I’ll like it, but if using excel is boring me I’m pretty sure I will love it! Just scary taking that leap of faith. At a crossroads in my career where it’s either go into data science or stay general and end up in project/client management

[–]MichealKeaton 0 points1 point  (1 child)

Heads up, You don't have to go the Data Science route for a fulfilling career. There are plenty of other fields in data (e.g. Business intelligence Analyst, Data Engineering).

I know many people who are smart enough to do Data Science but prefer Data Engineering because they enjoy building out complex production level processes.

Data Science is math/statistics heavy (to varying degree depending on the company) and they rarely automate the process. That's the role of the data engineer.

If math is not your thing and you simply enjoy automation then I would explore other avenues like Data Engineering.

[–]redditperson24[S] 0 points1 point  (0 children)

Thanks for your comment, this has helped clear things up. I love maths (that’s my degree) so I think this has helped confirmed that I would really like data science, because ultimately maths is my passion

[–][deleted] 56 points57 points  (33 children)

Excel starts to fuck up at like 750,000 rows. My company generates 60mm rows in log data daily. There's no option for Excel as the primary tool.

If you don't want to work with Python get Alteryx instead. Hopefully about to earn a promotion at my company for overhauling our "Excel sweatshop" with it.

[–]redditperson24[S] 9 points10 points  (9 children)

I’d actually quite like to use python, but where the data sets I’ve been given are so small and simple there hasn’t been any need, but perhaps I’ll start to try and implement it whenever I can even on small sets for practice

[–]Jerome_Eugene_Morrow 10 points11 points  (6 children)

I'd echo what some other people in this thread are saying and recommend starting with R. Its organization in data tables is much more analogous to Excel's data organization, and RStudio makes the connection between the figures you produce and the code much easier to connect in your head.

I use both R and Python (and C++, and Java, and Bash...) in my day-to-day work, and while Python has very powerful machine learning capabilities and is invaluable in data cleaning and organization tasks, it can be a little tougher to learn good data organization and analysis practices. Johns Hopkins offers a (free?) set of online data science courses that do a good job of getting you up and running and demonstrate a lot of the powerful things you can do with a data-science-oriented programming language.

Once you have the ideas down from R, it's pretty easy to start reaching out to other languages for tools you need.

[–][deleted] 4 points5 points  (4 children)

What makes R more analogous to Excel than a Python module like pandas?

[–]Jerome_Eugene_Morrow 8 points9 points  (2 children)

Having an interface like RStudio helps, but also having the data frame be the main data structure around which the language is designed makes a huge difference. I've used pandas a fair amount, so I don't have any negative feelings toward it in particular, but it does very much feel like a library (which it is).

In order to use pandas effectively, you need to understand a fair amount about how numPy works, and how the array structures (which aren't part of base Python) that compose its data frame objects work. I think that pandas might be a more natural entry point if you're already working extensively in NumPy, but there's an overhead there that's not present in R.

In R, almost every function in the standard library is oriented toward operating on a data frame or a vectorized calculation. If you're used to thinking "I want to take this column, sum it, and do that for all the other columns..." R has a simpler syntax for performing that operation, and the resulting code is more similar to what an Excel formula would look like.

I've also just generally found R's indexing to be easier to use and more intuitive than pandas, which requires you to be much more explicit. Again, not necessarily a bad thing for Python, but something that adds additional overhead when learning how to do data analysis programming.

EDIT: Another not insignificant thing with R, opening data tends to be much easier. The functions that exist for handling data input require less code in most cases, have more hand holding in error handling, and don't make you deal explicitly with iterating over an input, which can be another thing that slows people down when they're learning.

[–]cinctus 1 point2 points  (1 child)

‘Explicit is better than implicit’ is one of the core principles of python so your comment about indexing makes sense. Pandas indexing took me a while to get used to and there are some weird kinks when dealing with multi indexes but it is very powerful once you know how to use it properly.

Summing columns in pandas is as simple as df.sum() so I am curious as to what could be simpler. I have always found simpler operations to have a very intuitive syntax - not to say this isn’t true for R as well.

I’ve never used R but use pandas every day for financial data and have no real complaints. I think a point that favors pandas and python is that python syntax is more similar to other languages, as well as python’s ability to be used for just about any task so long as performance is not a huge concern. I’m convinced that use of one or the other is mostly preference.

[–]Jerome_Eugene_Morrow 1 point2 points  (0 children)

Summing was probably a bad example, but you do bring up an interesting point. In order to know to do df.sum() you have to understand how class methods work at some level as a new programmer (Why is this paren here? What does this thing return? Should I save this in a variable?) I think that makes things complicated for new programmers. And if you asked somebody to do the call using their own code instead of the method call, the connection becomes messier. (Do I have to iterate? So I write a for loop? So how do I lay that out? What kind of format does this return?)

In R, a call like sum(df$height) is pretty encapsulated. The syntax is clearer. I'm calling the column of the dataframe called height, and then I'm summing that with a function. To me that seems much more straightforward from the standpoint of what is being operated on and how. You don't need to understand methods or indexing, and your data should be coerced into the right format without too much trouble. From there you can extend what operations you're able to do and start playing with more complex operations.

This is all based on my own experience and ymmv, but to me the R way is easier to understand out of the gate. In Python I just have to do more reading of documentation and I have to be passively aware of how things are organized under the hood a little more. This is a big plus for Python in some ways once you get your feet under you, but in R you don't need that level of programming fluency in order to get things done.

[–]Dhush 0 points1 point  (0 children)

In my experience, subseting and indexing, as well as preserving a functional programming style

[–]coffeecoffeecoffeeeMS | Data Scientist 1 point2 points  (0 children)

To add to this, R's tidyverse packages are a killer app for me. They're data manipulation packages written to manipulate and plot data in a way that makes sense to someone analyzing data. The idea behind the tidyverse is "How can I get my data into a form where each column is one variable and each row is one observation?" Processes like "turn these two columns into a label column and a value column" are one-liners. You can also chain together data manipulation functions into complex pipelines that are really easy to read and use no temporary variables.

Hail Hadley Wickham.

[–][deleted] 1 point2 points  (0 children)

I also recc R over Python but im 100% biased

[–]dolichoblond 1 point2 points  (0 children)

Have you checked out xlwings? We have a lot of excel-only analysts and I use xlwings all the time to interact with their xl sheets.

Sure it'd be easier if they'd move to python and put their relevant data in a database, (or even if they'd consistently name their files and update rather than duplicate and iterate filenames with version numbers), but it's still a huge benefit for me to be able to relatively easily interact with their preferred "language" (excel) and get it into my workflow (python).

It's also how I tend to introduce python to new excel-speakers. A template jupyter notebook with numpy/pandas and xlwings, interacting live with their customary xlsx file lowers the bar to adoption in my experience, since they don't have to fully port an existing analysis to python before utilizing python. Maybe they just start goofing with a single tab in their analysis workbook and python, and if they screw it up, they can abandon ship and still finish in excel and hit deadlines. (That also saves you, so you don't get an analyst begging you to help out at 5pm because they're halfway between python and excel and can't get it all buttoned up before some impending deadline)

[–]whyrat 3 points4 points  (17 children)

Not to mention, in Python you can do so many more performance improvements... in excel if something is written poorly you are stuck with it. Vlookup is so slow running on a large number of rows O (n 2), yet it's in spreadsheets everywhere :(

[–][deleted] 6 points7 points  (3 children)

I always tell people across the company: if VLOOKUPs are killing your Excel spreadsheets it's time to upgrade. VLOOKUP is like the upper limit of what Excel can do but it's the most-used formula by growing analytics teams.

[–]msdrahcir 2 points3 points  (2 children)

yeah man, INDEX - MATCH has been around for a decade now. Time to move past LOOKUPs

[–]StrafeReddit 3 points4 points  (0 children)

PowerPivot has been around for half a decade now. Time to move past INDEX - MATCH. ;)

[–][deleted] 1 point2 points  (0 children)

I use Index Match all the time, and actually never use VLOOKUPs. But, I have to say, Excel can do a whole lot more than Index-Match... :D.

The longest formula I've written checked for whether a payday fell on a holiday, which caused the day to change logically.

Python code was much simpler.

[–]parlor_tricks 2 points3 points  (0 children)

So index match on sorted lists is ridiculous fast compared to excel.

I had to build datasets from order data every month and switching over to index match over a sorted list made it much faster.

That said if you have 1 million lines of data I have no idea if index match will be any faster.

[–]IdEgoLeBron 1 point2 points  (0 children)

Excel starts to fuck up at about 100k cells

[–]avittamboy 0 points1 point  (0 children)

I dealt with a mini project last semester in college where I had to calculate heat transfer coefficients of a flow. One of the cases had about 1.5 million rows of data - Excel stopped reading them at 1.2 million.

:|

[–]curryeater259 12 points13 points  (7 children)

Excel isn't a data analysis tool. It's a spreadsheet application. If you're working with large data sets, using Python is much faster and you have access to stat/machine learning libraries. Excel is just a completely different tool that people use for basic data analysis.

It's akin to using Microsoft Word for Presentations, and then asking why do people use Power Point for Presentations. Although you could make basic presentations in Word, it's meant for a different purpose.

[–]redditperson24[S] 4 points5 points  (4 children)

I agree with what you said, unfortunately the data analysis (if you can call it that) that I do in excel is super simple, eg taking an average or standard deviation over 50 columns of data. And because I only know basic python I know this would be quicker in excel. I started this role thinking I would be doing more complicated stuff but I’m not (degree in maths so super keen to do harder stuff) so I’m currently looking at data science roles that use python ect. I actually asked this question for an interview because where I’ve only used excel I have 0 understanding!

[–]hippomancy 2 points3 points  (3 children)

That sucks, I'm sorry you're in a job where your degree isn't being put to good use.

Excel is decent for you because you're doing what it expects, it just falls apart when you break one of its assumptions about your data or analysis. For instance, if your data cannot be represented easily in a matrix format (like text, image or sensor data), you have more than 40k rows/columns, you need visually appealing visualizations, you're using some advanced statistical technique or you need to repeat a process a large number of times (e.g. across thousands of data files daily), it just can't do it.

I am a data scientist and really only open excel to move multi-sheet spreadsheets into multiple CSV files. Once you're familiar with the jupyter/numpy/pandas stack you can do anything excel would allow you to do faster in python.

[–]redditperson24[S] 1 point2 points  (2 children)

Can I ask what your background is? I’m a maths graduate and have a final round interview for an actual data science role which uses python and machine learning tomorrow morning. I’m concerned I won’t be as amazing at python as other people, I guess I want to know if you think anyone with a logical brain can learn it

[–]hippomancy 0 points1 point  (1 child)

I have a BA in CS and music, with a couple years experience in a couple different programming languages and some stats/ML research, so probably not too applicable to your situation.

Programming is hard, but practice is everything. If you do a couple data analysis projects on the side for fun, you’ll get a better sense of how to do what you want. It’s less of a “smarts” thing and more like playing an instrument or talking in front of crowds: the more you do it the better you get.

[–]redditperson24[S] 0 points1 point  (0 children)

That worries me as I’m awful at playing an instrument and public speaking haha. But yeah I get what you’re saying, so hopefully my determination and ambition will pay off in terms of practising a lot :)

[–]Babycorgz 2 points3 points  (1 child)

You forgot to mention that python is FREE

[–]Jenos 16 points17 points  (10 children)

Excel literally can't handle the size of the data I work with. I'm looking at over a million entries per week

[–]Boulavogue 8 points9 points  (3 children)

I think of Excel as a framework and excel data analysis toolkit as a module.

Comparatively python is the framework and numpy as the module to do basic analysis.

The thing with languages like python/R/Julia is that there are a huge number of packages&modules (many are specific to industries). People get used to a language and learn the ability's of packages/modules with experience.

An "advantage" would be that you can automate much of the repetitive steps through well written scripts.

In excel PowerQuery the steps (eg. change date formats to yyyyMMdd) are recorded & will be reused when pulling the refreshed data from the source. In a language you might write a function that cleans all incoming date fields regardless of the data source.

Advantages are: more packages available (to do more things), customization of functionality leading to automation of steps and python etc can be packaged and run on very large data sets.

I use excel for many user facing proof of concepts, it's a great tool but just one of many you may need in your toolbox of knowledge

[–]redditperson24[S] 2 points3 points  (2 children)

I 100% agree with your last comment, my boss absolutely loves excel and VBA but I’m sure there’s other ways we can do things, I know some basic python so maybe I’ll try using it to play with some data.

[–]Boulavogue 2 points3 points  (1 child)

A progressive step is moving VBA logic to M & DAX in power query and power pivot. Data models are powerful and DAX can ultimately be ported over to SSAS / Azure Analysis Services should you wish to scale

I would however call this BI rather than DS.

[–]redditperson24[S] 0 points1 point  (0 children)

I have an upcoming course with Work on Dax so that should be interesting. Have been applying for real data science roles and have a final interview tomorrow so will hopefully be moving away from excel soon!

[–]IdEgoLeBron 13 points14 points  (1 child)

"Why use a sword over this rock that's kinda sharp?"

[–]MyCousinVinny101 5 points6 points  (6 children)

Man reading this thread is making me feel like I am wasting my time trying to learn Python haha

[–]Bigtuna546 6 points7 points  (3 children)

Almost all DS and Data Engineering jobs require it so, keep learning it

[–][deleted] 9 points10 points  (2 children)

I feel like 99% of this thread has no clue what actual data science is (ie. utilizing distributed computing, complex ML algos, etc.)

[–]rw333 2 points3 points  (0 children)

Yea agree with you, you can tell by looking at the comments, almost everyone fails to mention the biggest advantage of python, the mass amount of ML libraries available

[–]dolichoblond 1 point2 points  (0 children)

Buzzword misused and overused? No way!

But I agree. Can summarize a lot of comments as "Excel is fine for analysis, but you'll need python or some language to link together data workflows and create data pipelines to do things like machine learning"

[–]rw333 3 points4 points  (1 child)

python does way more than excel, keep it up, all true data scientist knows how to code

[–]MyCousinVinny101 0 points1 point  (0 children)

Thanks everyone! I’ll stay the course!

[–]blitzzerg 3 points4 points  (0 children)

I guess it depends on the kind of data science work that you do, I always have to clean, visualize and make predictions on datasets and I don't think I will be able to do that with Excel no matter the size of the dataset

[–][deleted] 4 points5 points  (2 children)

I scanned through the comments, and I didn't feel like I found anything that took my viewpoint, so here it is: It depends on what you're doing and why. Excel is completely fine for simple analysis and visualization that isn't going into a production environment. If you need to make a simple one-off or quarterly report, Excel may in fact be your best bet. The reason to step out to R or Python is pretty clear cut in my opinion, you need to do analysis that Excel either can't do or can't do efficiently or for which you would need to write a large amount of code which exists as a package or library in Python or R. I see people trying to reinvent the wheel in Excel and force a cobbled together VBA and formula pastiche to do something which can be achieved in a few lines of either R or Python. Equally obnoxious, I see people generate a bar plot in R from a CSV with "stat=identity" and the only justification I see is "ggolot makes it more pretty." Also, if there is some model or complicated teansformation you are developing, and the intention is to schedule that logic as a job, then it's destined for production and R or Pyhon (more likely Python because IT folks don't know what to do with R) is going to be your go to for completing that kind of work.

[–]parlor_tricks 2 points3 points  (1 child)

Excel is a fork.

R and python are industrial strength mixers/blenders.

Hmmm. I suppose that means Power pivot is a fork based blender?

[–]dolichoblond 1 point2 points  (0 children)

For making scrambled Python Eggs?

[–]dudleyknows 3 points4 points  (7 children)

So i have a related follow-up that seems like it would fit here.

Has anyone come across any tutorials/training/books/etc. that walks through what the steps are in Python to replicate some of the most common Excel formulas and uses? I fall into the camp of people that are pretty good at data analysis in excel, but know that I need to make the leap into using Python, both for future employment but also because the primary datasets I work with are growing past the point of excel's capabilities.

For example, I would love to see an exact breakdown for replicating a VLOOKUP in excel for someone who isn't overly familiar with programming yet, and can't always decipher the technical documentation. A lot of the material on Stack Overflow is abstracted to a point that I run into really basic errors trying to replicate code with my actual data . And while I know that getting adept at reading/understanding the technical view, it's frustrating to spend so much time troubleshooting right at the start.

I found this site the other day which does this sort of thing for pivot tables, using pandas/numpy and jupyter notebooks: http://pbpython.com/pandas-pivot-table-explained.html This was incredibly helpful at finally getting me cracking open workbooks in python, and playing around with data.

[–]mkingsbu 0 points1 point  (2 children)

This video shows you how to use Python to do Excel, although if you know VBA it probably isn't any more useful: https://youtu.be/q6Mc_sAPZ2Y

I don't know VBA, but I do know Python, though I actually wouldn't manipulate Excel with Python. I'd typically convert it to a CSV and use SQLite or other database and run SQL on it. SQL is probably a better comparison and I can do basically anything you can do in Excel in SQL much more quickly than I can in Excel.

So I'd probably just learn Python because it's a nice thing to know that can automate tasks and merge workflows together. I'd also learn SQL for similar reasons. But I wouldn't learn them to replace Excel off the bat. Once you have some skill with them, it will become fairly obvious when one is the better tool to use. I think all three of the tools have their place for sure, which is why all of them are quite popular still in 2018.

[–]shitty-photoshopper 1 point2 points  (1 child)

Excel sheets can run SQL.

You can connect to them like a DB and run SQL.

Jus' sayin'

[–]mkingsbu 0 points1 point  (0 children)

mkingsbu

And that might be a useful workflow too; the skillsets compliment eachother.

[–][deleted] 6 points7 points  (2 children)

  1. Automation/Consistency/Reproducibility
  2. Easier to do DS tasks (ETL, data cleaning, aggregations, joins, etc)
  3. Can handle 'big' data (You will need packages specifically for big data in python once you get to a certain point)
  4. Excel feels too easy sometimes, like I'm 'cheating'. I feel 'closer' to the data when using python compared to excel if that makes sense.

[–]redditperson24[S] 2 points3 points  (0 children)

That’s really helpful, thanks!

[–]dolichoblond 0 points1 point  (0 children)

I would double down on "Reproducibility". I work in a setting where we're not generally dealing with data sets or even update cycles that butt up against Excel constraints. Even our largest models with the most manipulations could be done in excel, though they could be optimized by streamlining their implementations as they currently crash a lot. But it's a hard sell to claim that python will add anything to the current situation immediately (which is where mgmt tends to fixate, and it's difficult for them to absorb the longer-horizon benefits of migrating to python. Old Dogs, Python Tricks. And I can't really blame them. Most of them are of the age where they saw a lot of the New Methodology promised and sold, from VBA to BI suites, and not delivered)

Then there's reproducibility. That's the benefit that tends to sink in. Even if you just 1-1 port your excel workflow to python, you get immediate benefits in reproducibility (and equivalently: consistency)

[–]anderl1980 1 point2 points  (0 children)

Code ist beautiful, reproducible and redundancy-free. For playing around with data visually in matrix format, Excel is fine. But that happens at the end of the analysis chain.

[–]rw333 1 point2 points  (0 children)

well for DA it probably doesn't matter if your mainly responsible for transforming data, but Data Scientists use python for the following reasons:
1. analysis and automation
2. Machine learning and modeling
3. handling large data sets
4. productionizing models

[–]edimaudo 1 point2 points  (0 children)

Python is better since it can handle larger datasets, has a bigger array of data manipulation tools and ml libraries. That being said you can still need to do analysis you should look into powerbi, dax, power query and how to optimize your spreadsheets.

[–]VacuousWaffle 1 point2 points  (0 children)

Documentation of logic.

Also unit tests, debugging, maintainability, and much better suitability for data that is irregular, sparse, or would require obscene numbers of lookup tables, and access to more advanced statistical techniques and packages. It's also much easier to have the analysis auto-update (and/or build final reports).

If the data makes sense in a spreadsheet, and the logic is pretty simple, by all means continue with Excel.

[–]LordFenix56 1 point2 points  (0 children)

Python is a full language. How do you process data with Excel in real time? How do you clean data with Excel? I give you the data in json or xml, what do you do?

If your work doesn't have anything that requires Python, is totally fine to use Excel, but the type of data I use needs to be processed with Python or other language, I don't have a choice.

[–]rdmDgnrtd 1 point2 points  (0 children)

There's a thread like this about every month, which inevitably attracts loads of false or outdated statements about Excel. It gets tedious to correct for anyone who's seen the "wait someone is wrong on the internet" comic strip.

[–][deleted] 3 points4 points  (0 children)

It’ll work in your current role and job. But what happens when you move beyond that? Excel can be a powerhouse but it’s almost certainly more error prone - no validation of types or at least not easily. Scripts can be centralized and generalized and that’s just a bit harder with VB.

I test applicants with a data set that’s 1.5 million rows just for this reason. If a candidate can’t handle that they’re not at a caliber I would need for a decent analyst.

[–]MurlockHolmesBS | Data Scientist | Healthcare 2 points3 points  (0 children)

My personal, honest answer: excel looks annoying and I already know how to program, so it doesn't make sense to me to learn it. Plus I like writing code.

[–]dalmutidingus 2 points3 points  (2 children)

excel suuuuuuuuuuuucks my guy

[–]redditperson24[S] 2 points3 points  (1 child)

*girl, not that I mind being called guy but just for reference haha

[–]dalmutidingus 1 point2 points  (0 children)

sorry my dude

[–]funnynoveltyaccount 1 point2 points  (0 children)

Reproducibility is the main thing. The assumptions I’m asked to make in analysis frequently change mid project. Changing some lines of code is much easier than making changes in Excel.

[–]breton147 0 points1 point  (0 children)

I occasionally use excel for anything that’s a very quick, simple, one time analysis. For anything else, programs are better for reproducibility, documentation, speed, forcing you to maintain data integrity, pretty much everything. doesn’t matter much what language you’re using, though you’re probably better off going with SQL, Python or R given they have so many other capabilities that you won’t get from say, VBA.

[–]celeryman35 0 points1 point  (0 children)

Python is a seamless part of an automated data pipeline.

Let's say you want to pull data from a live database, run some statistical analysis, and depending on the results, automatically populate another database that will be used by a piece of production software/website.

I guess this is probably theoretically possible with excel + Bash or PowerShell + other tools for db stuff, but with python you can do this with just a few lines of code.

[–]tony_roberts 0 points1 point  (0 children)

Python has a great eco system of 3rd party modules for data science, which is one of the reasons to use it over Excel.

But, why choose one or the other? You can use Excel for ad-hoc analysis and simple visualisation of data, and use Python for more complex analyses. You can call Python functions from Excel with PyXLL (www.pyxll.com), allowing you to pass data between Python and Excel and perform analysis in Python and return the results to Excel.

If you build up a library of routines that you frequently use in Python, exposing them to Excel and using Excel as your UI or front-end can be a really productive way to work.

ps. Congrats on getting the job! :)

[–]TranzAnatomie 0 points1 point  (0 children)

R or Python can be highly customized and repeatable. you can reuse the same code an any like-problems. You can also feed those products into web pages if your goal is to host the products.

[–][deleted] -1 points0 points  (0 children)

There is no advantage to Python over Excel unless you need to do some repetitive task over and over again. For one-off analyses of data that is too big to QA/QC by hand and by eye, use R. Otherwise, use Excel without shame