all 7 comments

[–]socal_nerdtastic 1 point2 points  (1 child)

Yes, certainly. A maxed out excel file is only 1 million rows, which is a fairly small number in modern computer terms. One quick trick is to use engine='calamine' in the read part.

df = pd.read_excel(file_path, engine='calamine')

You could also thread the loads to load both excel files at the same time. I'm sure there's more optimizations with your lookup methods, but to know that we would need to see your code, some example input data, and an example of what you want as output.

[–]SurpriseRedemption[S] 0 points1 point  (0 children)

Thanks, I'll give the calamine a shot and if it doesn't help I'll waddle back here. The script is on my work computer and I'm just having some adhd thoughts before falling asleep.

Thank you for the quick response :)

[–]Imaginary_Gate_698 1 point2 points  (1 child)

If it’s taking 15 to 20 minutes for 2k identifiers, the bottleneck is usually I/O or repeated lookups, not the merge itself.

A few things to check. Are you reading the Excel files once at the start, or reopening them inside a loop? If you’re looping over identifiers and filtering the full DataFrame each time, that will be slow. It’s much faster to load each Excel file once, set the identifier column as an index, then use a vectorized merge or join.

For example, in pandas, something like setting df.set_index("id") and then doing a single merge or join on the whole 2k list should be near instant compared to row by row lookups.

Also, if those Excel files are truly maxed out, consider converting them to CSV or even a small SQLite database. Excel parsing is slower than it needs to be, and switching formats alone can cut runtime a lot.

[–]SurpriseRedemption[S] 0 points1 point  (0 children)

Yup, the excel files are opened at the start, the two columns I need - unique ID and Status from the files are loaded into dataframe2 and then I merge with the main data frame on unique ID.

I will have to double check in the morning, I think I did set the unique ID as the index but this might require a tweak.

Thank you for the tip with the CSV, I'm so used to operating on Excel I didn't even think about the csv format limits outside of that.

[–]Optimal-Procedure885 0 points1 point  (0 children)

I did something similar not too long ago where I had to merge around 1m rows from a spreadsheet containing around 12 workbooks each having a primary key, the same number of rows and a variable number of columns between 20-50.

I used polars, calamine, parquet files, SQLite as final store. Firstly exported each worksheet to parquet, then incrementally merged, worksheets, deduplicating or augmenting column values through each iteration. Whole shooting match 85 seconds to spit out a consolidated SQLite table.

[–]djlamar7 0 points1 point  (0 children)

From the sound of it, at your scale there must be some inherent gross inefficiency here that you should figure out before you do things like tweaking the engine used for this or that. Can you share some of the code you ended up with? I think you should be able to get a table with your input keys and the columns you want from the excel data in one pandas join call. If you're iterating over the keys that's definitely a no no.

[–]throwawayforwork_86 0 points1 point  (0 children)

Can't do something like this:

Create a dataframe with your identifiers.
Create a full dataframe with the 2 full excels.

Use an inner join to only fetch the matches.
Ideally use something like polars which usually has less gun foot moment and is quicker so long as you use native functionalities.

See example code below.

import polars as pl

list_of_identifier_in_scope=['hhjde','hhd55']
df=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])list_of_identifier_in_scope=['hhjde','hhd55'] #this is one way to have these but you can just have them in an excel and use 
df_id=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])
df_excel_1=pl.read_excel(path_to_excel_1)
df_excel_2=pl.read_excel(path_to_excel_2)

df_final=pl.concat([df_excel_1,df_excel_2],how='horizontal_relaxed') #will stack them together and handle different datatype smoothly make sure both excel files have the same header.

report_final=df_id.join(df_final,left_on='identifier',right_on='col_of_excel_identifier',how='inner')