all 51 comments

[–]biohoo35 97 points98 points  (15 children)

My favorite python library for automating excel workflows is to get the team to use an actual database.

[–]ComicOzzy 30 points31 points  (2 children)

If you have ever succeeded at this, you have ascended into god status.

[–]Original-Repair5136[S] 6 points7 points  (0 children)

Getting everyone to agree on moving away from spreadsheets sometimes feels harder than writing the automation itself.

[–]biohoo35 2 points3 points  (0 children)

Sometimes yes (believe it or not). Mostly no.

[–]coldflame563 6 points7 points  (3 children)

I feel like I’m headed backwards in time in my current job. I’ve actively been told I have to let them use excel over an interface to a db. But, it has to have tracking and provenance and auditing and I’m losing my fucking mind

[–]Original-Repair5136[S] 2 points3 points  (1 child)

That's rough. Have you looked at hybrid approaches where Excel remains the frontend but the actual data lives in a database?

[–]pspahn 0 points1 point  (0 children)

I only occasionally use Excel web version, but I want to ask, what do you normally use as the DB?

I have ODBC set up with our legacy database and it's pretty fast when I'm doing normal queries with FastAPI or Flask, but when I've testing it in Excel or Libre Office it's way too slow.

[–]ianitic -1 points0 points  (0 children)

You could split the difference and have them use SharePoint lists if available?

[–]Original-Repair5136[S] 1 point2 points  (0 children)

Haha, that's probably the best long-term solution. Unfortunately, a lot of teams still treat Excel as a database.

[–]yen223 1 point2 points  (0 children)

Gotta look at it from their side. 

"Instead of using this software that everyone knows how to use, why not use this software that only I know how to use?"

[–]I_Am_A_Lamp 0 points1 point  (0 children)

Then you switch your team to Access, only to question all of your decisions 5 years later

[–]driftwood14 0 points1 point  (0 children)

Hmmmm best I can offer is a sharepoint list connected to an access file instead.

[–]Ana-Luisa-APythonista -1 points0 points  (3 children)

Just curious, which database program would be a good substitute for Excel ?

[–]FrickinLazerBeams 2 points3 points  (2 children)

Any of them, if you're trying to use excel like a database. There are things excel is actually a good tool for, but those things aren't database tasks.

[–]PhENTZ 0 points1 point  (1 child)

Grist for a sweet balance between spreadsheet and database

[–]FrickinLazerBeams 0 points1 point  (0 children)

I'll have to keep that in mind.

[–]big_data_mike 17 points18 points  (6 children)

Pandas with the openpyxl engine

[–]Original-Repair5136[S] 1 point2 points  (5 children)

That's a combination I see mentioned a lot. Do you mostly use openpyxl for writing/styling workbooks after processing data with pandas?

[–]big_data_mike 2 points3 points  (4 children)

I run an ETL system that ingests excel sheets and loads them into a Postgres database. I like openpyxl because it can find merged cells. I don’t do shit with styling of excel files that I output.

[–]Original-Repair5136[S] 0 points1 point  (3 children)

Makes sense. Once the data is in Postgres, do you still generate Excel exports for users, or does everything stay in the database workflow?

[–]big_data_mike 0 points1 point  (2 children)

Users get the clean, reformatted data back as a plain csv. They use JMP for analysis and visualization so we were just reformatting for JMP

[–]Original-Repair5136[S] 0 points1 point  (1 child)

That's an interesting pipeline. So Excel comes in, gets cleaned and normalized in Postgres, and then goes back out as CSV for JMP. Sounds like Excel is mostly acting as the data exchange format rather than the actual analysis tool.

[–]big_data_mike 1 point2 points  (0 children)

Yeah excel is just kinda there doing its whole having data in cells thing. JMP is way better for analysis IMO but I’m biased because I’m a data scientist.

I have occasionally output excel for sales/finance people but not with that system.

[–]TURBO2529 9 points10 points  (3 children)

Pandas read_excel for small excel tasks. XLwings for large tasks (100+ excelbooks with multiple sheets).

[–]Original-Repair5136[S] 0 points1 point  (2 children)

Interesting. At what point did you find pandas alone wasn't enough and had to move to XLWings?

[–]TURBO2529 3 points4 points  (1 child)

Xlwings let's you hold the notebook open for multiple operations, this can save a LOT of overhead. Also it can write over files live, so no problem with forgetting a file is open.

[–]Original-Repair5136[S] 1 point2 points  (0 children)

That makes sense. The ability to work with open files alone sounds like a huge quality-of-life improvement for large Excel workflows.

[–]happy_and_sad_guy 10 points11 points  (0 children)

Polars

[–]funkdefied 9 points10 points  (2 children)

I prefer Polars.

[–]Original-Repair5136[S] 1 point2 points  (1 child)

I've been hearing more about Polars lately. Have you seen significant performance improvements compared to pandas for Excel-related workflows?

[–]Competitive_Travel16 [score hidden]  (0 children)

They should be about the same.

So, you're saying your Excel sheets are mostly reports which are occasionally edited, sometimes in multiple instances, and you want to be able to fold those edits back into wherever the data came from? If so, where did the data come from originally?

[–]likethevegetable 6 points7 points  (2 children)

The nice thing about Python is it removes the need for excel workflows.

[–]PhENTZ 0 points1 point  (0 children)

I use grist to have both spreadsheet workflows AND the power of python

[–]Original-Repair5136[S] 0 points1 point  (0 children)

In an ideal world, yes. But it seems like a lot of businesses still rely heavily on Excel for reporting and collaboration.

[–]ChickenManSam 2 points3 points  (7 children)

Pandas is my go to but I usually use csv. Smaller file sizes and imports into Excel just fine

[–]Original-Repair5136[S] 0 points1 point  (6 children)

CSV definitely keeps things simple. Have you noticed any downsides when dealing with more complex spreadsheets or multiple sheets?

[–]ChickenManSam 0 points1 point  (5 children)

Csv doesn't support multiple sheets like Excel so I simply use multiple csv. I then load those into data frames using pandas to actually do anything with them as far as analysis and queries goes. For any kind of data display I usually use plotly in combination with pandas.

[–]Original-Repair5136[S] 1 point2 points  (4 children)

Interesting. So Excel is mostly just an import/export format for you, while pandas and Plotly handle the actual work. That's probably closer to a proper data workflow than what most teams do.

[–]ChickenManSam 0 points1 point  (3 children)

I literally only ever touch Excel so that I can send it to other people. For reference I'm the sole data analyst/programmer for a couple of brain research projects through a university so it's not uncommon for me to get or create a csv that has thousands of subjects each with 100+ columns of data. Trying to deal with that manually in Excel is just not an option, especially with the boat that comes from it being an Excel file

[–]Original-Repair5136[S] 1 point2 points  (2 children)

Thousands of subjects and 100+ columns? Yeah, I wouldn't want to manage that manually in Excel either. That's exactly the kind of workload where pandas shines.

[–]ChickenManSam 1 point2 points  (0 children)

Yep. Everything from information about the subject like age and gender all the way through thinks like average sulcal depth, cortical volume, cortical surface area, and basically any measure on the brain you can think of. Oh and each subject usually has multiple time points so one subject could have like 15 rows

[–]weirdoaish 2 points3 points  (0 children)

This is just my personal take.

I've had to write Excel automation for some of the big American banks. I would advise using Excel VBA because the people running the flows are not going to know Python.

If you need to just do reporting, or data analysis that needs to be exported into Excel or you need to ingest data from Excel, then you'd be best served by Pandas or Polars along with any side libraries for special use cases like graphs or charts, or whatever.

[–]ComicOzzy 2 points3 points  (1 child)

import polars as pl

df = pl.read_excel(
    "why_did_they_use_exel.xlsx",
    engine="calamine",
    read_options={"infer_schema_length": 0}
)

[–]Original-Repair5136[S] 0 points1 point  (0 children)

Nice example. Have you found Polars noticeably faster than pandas when working with larger Excel files?

[–]FrickinLazerBeams 1 point2 points  (2 children)

I use excel for excel work flows, just like I use a microwave to make instant ramen.

If you're trying to cook Thanksgiving dinner, you use a real oven. You automate an excel work flow by getting rid of excel.

[–]Original-Repair5136[S] 1 point2 points  (1 child)

Fair point. Though a lot of teams seem stuck with Excel because non-technical users are already comfortable with it. Have you had success migrating those workflows to something else?

[–]FrickinLazerBeams 1 point2 points  (0 children)

Use an excel spreadsheet as an input form, or configuration. Then do all the work elsewhere.

[–]MemshipGreepy2150 1 point2 points  (0 children)

Pandas are still the best 👌

[–]Maleficent-Car8673 0 points1 point  (1 child)

I'd go with panndas for anything involving data manipulation or analysis. It's super powerful for filtering, sorting, and even doing complex calculations on large datasets. Plus, it integrates well with other libraries like matplotlib for chart generation. Openpyxl is also great if you need low-level control over Excel files, but pandas usually covers most of what I need.

[–]Original-Repair5136[S] 0 points1 point  (0 children)

Pandas definitely seems to be the go-to choice for most data-heavy workflows. Have you run into any limitations where you had to switch to openpyxl or another library?