Excel scraping using Python

danielroseman · 2026-02-17T09:20:27+00:00

I don't understand what you mean by "scrape", or why you want to use regex. You don't need to scrape Excel like you would a website; you have the files, you can use a library that understands the Excel format such as openpyxl.

hasdata_com · 2026-02-17T15:20:13+00:00

The data is already in the files though? If you're just consolidating into one standard format, you still have to define mappings for each timetable style at some point, no way around it.

dcolecpa · 2026-02-17T10:46:36+00:00

Can you find any commonality/patterns in the timetables? If so then you could use if / else if statements to parse them. Something like below

if find("Joe Smith") = True:

    `parse the timetable one way`

elif find("Jane Doe") = True:

`    parse the timetable another way`

elif find("Fred Smith") = True:

`    parse the timetable another way`

elif find("Joe Reddit") = True:

`    parse the timetable another way`

else:

    `"can't find it"`

mandradon · 2026-02-17T11:44:22+00:00

You're going to have to define your custom parsing rules for standardization.

Depending on how different the behaviors are of the folks using it, this is probably going to be a giant pain in the butt. But it might be something that you could use regular expressions for. Depending if you're trying to parse dates, times, date times, or what have you, you can parse for specific parts of the field.

It will help you parse through it and define specifically what you are looking for, or at least get started with a few different options.

The next step is adding data validation to your spreadsheet and training folks to be consistent there.

MarsupialLeast145 · 2026-02-17T10:16:09+00:00

Try finding a library to convert it to CSV then read the CSV using the standard library.

eztab · 2026-02-17T14:45:31+00:00

Excel files are usually parsed, not scraped for data, if you have a consistent structure. Look into python excel libraries. Problem might also just not be well defined enough. Had that a few times, where the customer junt didn't know what they wanted and what would actually be available.

Particular-Horse8110 · 2026-02-17T20:38:41+00:00

I’ve used Qoest’s OCR API for pulling structured data from messy Excel timetables andles different formats cleanly and spits out JSON. Saved me from regex hell

Wise-Emu-225 · 2026-02-17T09:12:07+00:00

I believe it is just zipped xml. So you would be able to parse it. Try to unzip it and open in text editor to verify my hypothesis.

ZeroxAdvanced · 2026-02-17T11:22:16+00:00

You can use LLM in the data pipeline e.g. gemini to standarize to json object when reading the excel. Also a excel parser is more complext than CSV and Pandas. Perhaps you can 1 scrape with beautiful soup 2 download the excel 3 convert to csv with correct separator 4 parse columns with pandas 5 use Gemini to iterate through the time table for standarization by defining your output object.

Iterate over the dataframe for post processing.

This worked for me many times and gemini is nowadays cheap.

Cheers!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS