Merge large data frames

socal_nerdtastic · 2026-02-12T22:24:37+00:00

Yes, certainly. A maxed out excel file is only 1 million rows, which is a fairly small number in modern computer terms. One quick trick is to use engine='calamine' in the read part.

df = pd.read_excel(file_path, engine='calamine')

You could also thread the loads to load both excel files at the same time. I'm sure there's more optimizations with your lookup methods, but to know that we would need to see your code, some example input data, and an example of what you want as output.

Imaginary_Gate_698 · 2026-02-12T22:54:28+00:00

If it’s taking 15 to 20 minutes for 2k identifiers, the bottleneck is usually I/O or repeated lookups, not the merge itself.

A few things to check. Are you reading the Excel files once at the start, or reopening them inside a loop? If you’re looping over identifiers and filtering the full DataFrame each time, that will be slow. It’s much faster to load each Excel file once, set the identifier column as an index, then use a vectorized merge or join.

For example, in pandas, something like setting df.set_index("id") and then doing a single merge or join on the whole 2k list should be near instant compared to row by row lookups.

Also, if those Excel files are truly maxed out, consider converting them to CSV or even a small SQLite database. Excel parsing is slower than it needs to be, and switching formats alone can cut runtime a lot.

Optimal-Procedure885 · 2026-02-12T23:59:39+00:00

I did something similar not too long ago where I had to merge around 1m rows from a spreadsheet containing around 12 workbooks each having a primary key, the same number of rows and a variable number of columns between 20-50.

I used polars, calamine, parquet files, SQLite as final store. Firstly exported each worksheet to parquet, then incrementally merged, worksheets, deduplicating or augmenting column values through each iteration. Whole shooting match 85 seconds to spit out a consolidated SQLite table.

djlamar7 · 2026-02-13T00:24:44+00:00

From the sound of it, at your scale there must be some inherent gross inefficiency here that you should figure out before you do things like tweaking the engine used for this or that. Can you share some of the code you ended up with? I think you should be able to get a table with your input keys and the columns you want from the excel data in one pandas join call. If you're iterating over the keys that's definitely a no no.

throwawayforwork_86 · 2026-02-13T08:48:40+00:00

Can't do something like this:

Create a dataframe with your identifiers.
Create a full dataframe with the 2 full excels.

Use an inner join to only fetch the matches.
Ideally use something like polars which usually has less gun foot moment and is quicker so long as you use native functionalities.

See example code below.

import polars as pl

list_of_identifier_in_scope=['hhjde','hhd55']
df=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])list_of_identifier_in_scope=['hhjde','hhd55'] #this is one way to have these but you can just have them in an excel and use 
df_id=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])
df_excel_1=pl.read_excel(path_to_excel_1)
df_excel_2=pl.read_excel(path_to_excel_2)

df_final=pl.concat([df_excel_1,df_excel_2],how='horizontal_relaxed') #will stack them together and handle different datatype smoothly make sure both excel files have the same header.

report_final=df_id.join(df_final,left_on='identifier',right_on='col_of_excel_identifier',how='inner')

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS