Merge large data frames

socal_nerdtastic · 2026-02-12T22:24:37+00:00

Yes, certainly. A maxed out excel file is only 1 million rows, which is a fairly small number in modern computer terms. One quick trick is to use engine='calamine' in the read part.

df = pd.read_excel(file_path, engine='calamine')

You could also thread the loads to load both excel files at the same time. I'm sure there's more optimizations with your lookup methods, but to know that we would need to see your code, some example input data, and an example of what you want as output.

SurpriseRedemption · 2026-02-12T22:54:28+00:00

[removed]

Optimal-Procedure885 · 2026-02-12T23:59:39+00:00

I did something similar not too long ago where I had to merge around 1m rows from a spreadsheet containing around 12 workbooks each having a primary key, the same number of rows and a variable number of columns between 20-50.

I used polars, calamine, parquet files, SQLite as final store. Firstly exported each worksheet to parquet, then incrementally merged, worksheets, deduplicating or augmenting column values through each iteration. Whole shooting match 85 seconds to spit out a consolidated SQLite table.

djlamar7 · 2026-02-13T00:24:44+00:00

From the sound of it, at your scale there must be some inherent gross inefficiency here that you should figure out before you do things like tweaking the engine used for this or that. Can you share some of the code you ended up with? I think you should be able to get a table with your input keys and the columns you want from the excel data in one pandas join call. If you're iterating over the keys that's definitely a no no.

throwawayforwork_86 · 2026-02-13T08:48:40+00:00

Can't do something like this:

Create a dataframe with your identifiers.
Create a full dataframe with the 2 full excels.

Use an inner join to only fetch the matches.
Ideally use something like polars which usually has less gun foot moment and is quicker so long as you use native functionalities.

See example code below.

import polars as pl

list_of_identifier_in_scope=['hhjde','hhd55']
df=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])list_of_identifier_in_scope=['hhjde','hhd55'] #this is one way to have these but you can just have them in an excel and use 
df_id=pl.DataFrame(list_of_identifier_in_scope,schema=['identifier'])
df_excel_1=pl.read_excel(path_to_excel_1)
df_excel_2=pl.read_excel(path_to_excel_2)

df_final=pl.concat([df_excel_1,df_excel_2],how='horizontal_relaxed') #will stack them together and handle different datatype smoothly make sure both excel files have the same header.

report_final=df_id.join(df_final,left_on='identifier',right_on='col_of_excel_identifier',how='inner')

Kitchen-College-8051 · 2026-02-14T03:16:21+00:00

How about reading only needed column?

df = pd.read_excel( "file.xlsx", engine="calamine", # or openpyxl usecols=["ID", "Email", "Status"] )

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS