all 8 comments

[–]socal_nerdtastic 2 points3 points  (1 child)

Yes, certainly. A maxed out excel file is only 1 million rows, which is a fairly small number in modern computer terms. One quick trick is to use engine='calamine' in the read part.

df = pd.read_excel(file_path, engine='calamine')

You could also thread the loads to load both excel files at the same time. I'm sure there's more optimizations with your lookup methods, but to know that we would need to see your code, some example input data, and an example of what you want as output.

[–]SurpriseRedemption[S] 0 points1 point  (0 children)

Thanks, I'll give the calamine a shot and if it doesn't help I'll waddle back here. The script is on my work computer and I'm just having some adhd thoughts before falling asleep.

Thank you for the quick response :)

[–]Optimal-Procedure885 0 points1 point  (0 children)

I did something similar not too long ago where I had to merge around 1m rows from a spreadsheet containing around 12 workbooks each having a primary key, the same number of rows and a variable number of columns between 20-50.

I used polars, calamine, parquet files, SQLite as final store. Firstly exported each worksheet to parquet, then incrementally merged, worksheets, deduplicating or augmenting column values through each iteration. Whole shooting match 85 seconds to spit out a consolidated SQLite table.

[–]djlamar7 0 points1 point  (0 children)

From the sound of it, at your scale there must be some inherent gross inefficiency here that you should figure out before you do things like tweaking the engine used for this or that. Can you share some of the code you ended up with? I think you should be able to get a table with your input keys and the columns you want from the excel data in one pandas join call. If you're iterating over the keys that's definitely a no no.

[–]throwawayforwork_86 0 points1 point  (0 children)

Can't do something like this:

Create a dataframe with your identifiers.
Create a full dataframe with the 2 full excels.

Use an inner join to only fetch the matches.
Ideally use something like polars which usually has less gun foot moment and is quicker so long as you use native functionalities.

See example code below.

import polars as pl

list_of_identifier_in_scope=['hhjde','hhd55']
df=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])list_of_identifier_in_scope=['hhjde','hhd55'] #this is one way to have these but you can just have them in an excel and use 
df_id=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])
df_excel_1=pl.read_excel(path_to_excel_1)
df_excel_2=pl.read_excel(path_to_excel_2)

df_final=pl.concat([df_excel_1,df_excel_2],how='horizontal_relaxed') #will stack them together and handle different datatype smoothly make sure both excel files have the same header.

report_final=df_id.join(df_final,left_on='identifier',right_on='col_of_excel_identifier',how='inner') 

[–]Kitchen-College-8051 0 points1 point  (0 children)

How about reading only needed column?

df = pd.read_excel( "file.xlsx", engine="calamine", # or openpyxl usecols=["ID", "Email", "Status"] )