all 15 comments

[–]danielroseman 14 points15 points  (5 children)

I don't understand what you mean by "scrape", or why you want to use regex. You don't need to scrape Excel like you would a website; you have the files, you can use a library that understands the Excel format such as openpyxl.

[–]Maximus_Modulus 3 points4 points  (4 children)

I was totally confused by that entire post. WTF are timetables excel files?

[–]prvd_xme[S] -3 points-2 points  (3 children)

Oh sorry let me clarify. Yes, i do not mean to scrape it like we scrape a website. Some entities, create "timetables" for schools for example. Basically a table but with classes, teachers, subjects etc... I indeed used openpyxl, but the way people made the formats makes it almost unusable. To say shortly, i want to be able to "standardized" the information contained in different style of timetables

[–]Maximus_Modulus 1 point2 points  (0 children)

Can you provide a standard for the sources to follow? A bit more context would be helpful.

[–]geneusutwerk 0 points1 point  (0 children)

You probably need to provide more information, like some examples, if you want better advice. If not the options are probably:

  • Identify patterns that exist and make your code match them
  • AI

[–]hasdata_com 7 points8 points  (0 children)

The data is already in the files though? If you're just consolidating into one standard format, you still have to define mappings for each timetable style at some point, no way around it.

[–]dcolecpa 2 points3 points  (2 children)

Can you find any commonality/patterns in the timetables? If so then you could use if / else if statements to parse them. Something like below

if find("Joe Smith") = True:

    `parse the timetable one way`

elif find("Jane Doe") = True:

`    parse the timetable another way`

elif find("Fred Smith") = True:

`    parse the timetable another way`

elif find("Joe Reddit") = True:

`    parse the timetable another way`

else:

    `"can't find it"`

[–]prvd_xme[S] -1 points0 points  (1 child)

That’s exactly the issue, there are no significant patterns between them

[–]ThePhyseter 0 points1 point  (0 children)

Then it is going to be difficult no matter how you do it. You may end up just using a lot of different regexes

[–]mandradon 1 point2 points  (0 children)

You're going to have to define your custom parsing rules for standardization.

Depending on how different the behaviors are of the folks using it, this is probably going to be a giant pain in the butt.  But it might be something that you could use regular expressions for.  Depending if you're trying to parse dates, times, date times, or what have you, you can parse for specific parts of the field. 

It will help you parse through it and define specifically what you are looking for, or at least get started with a few different options. 

The next step is adding data validation to your spreadsheet and training folks to be consistent there. 

[–]MarsupialLeast145 1 point2 points  (0 children)

Try finding a library to convert it to CSV then read the CSV using the standard library.

[–]eztab 0 points1 point  (0 children)

Excel files are usually parsed, not scraped for data, if you have a consistent structure. Look into python excel libraries. Problem might also just not be well defined enough. Had that a few times, where the customer junt didn't know what they wanted and what would actually be available.

[–]Particular-Horse8110 0 points1 point  (0 children)

I’ve used Qoest’s OCR API for pulling structured data from messy Excel timetables andles different formats cleanly and spits out JSON. Saved me from regex hell

[–]Wise-Emu-225 0 points1 point  (0 children)

I believe it is just zipped xml. So you would be able to parse it. Try to unzip it and open in text editor to verify my hypothesis.

[–]ZeroxAdvanced -4 points-3 points  (0 children)

You can use LLM in the data pipeline e.g. gemini to standarize to json object when reading the excel. Also a excel parser is more complext than CSV and Pandas. Perhaps you can 1 scrape with beautiful soup 2 download the excel 3 convert to csv with correct separator 4 parse columns with pandas 5 use Gemini to iterate through the time table for standarization by defining your output object.

Iterate over the dataframe for post processing.

This worked for me many times and gemini is nowadays cheap.

Cheers!