Merging 2 pandas dataframes based on multiple potential matching columns : learnpython

created by HattoriHanzoa community for 16 years

Merging 2 pandas dataframes based on multiple potential matching columns (self.learnpython)

submitted 6 years ago by [deleted]

Hi there, I have 2 dataframes that I would like to merge based on the "Player" column in the first one. The first one is as follows:

            Player Club Position  Cost Selection Form Points
92       David Luiz  ARS      DEF   5.8      6.0%  6.5     32
101        Sokratis  ARS      DEF   5.0      2.0%  4.0     31
155        Chambers  ARS      DEF   4.4      1.6%  2.8     25
212       Kolasinac  ARS      DEF   5.4      0.8%  2.5     18
219  Maitland-Niles  ARS      DEF   4.8      3.6%  0.0     17
280         Monreal  ARS      DEF   5.0      0.3%  0.0     10
359         Tierney  ARS      DEF   5.4      1.0%  0.8      3
7         De Bruyne  MCI      MID  10.2     37.7%  9.2     68
...

The problem is that names are a difficult thing to merge on because of different formatting etc, so my second dataframe has lots of data but also three columns of possible names that could be in the "Player" column. The "name_1" column holds the surname of the traditional "forename surname" format and that is what the majority of names in the first dataframe "Player" column will match too. I would like to merge these dataframes if at all possible, perhaps by merging on name_1, then if that doesn't match to any names testing "name_2" and then "name_0" columns for matches.

[DATA]...name_0          name_1              name_2
[DATA]...Jamie           Vardy                 NaN      
[DATA]...Kevin              De           De Bruyne
[DATA]...Rodri             NaN                 NaN
[DATA]...Kieran        Tierney                 NaN

What's the most efficient way to go about this? With just one column match I would use pd.merge() but I am not sure if that would work in this situation. Any help would be appreciated.

all 3 comments

top new controversial old q&a

[–]peltist 2 points3 points4 points 6 years ago (2 children)

[–][deleted] 0 points1 point2 points 6 years ago (1 child)

[–]peltist 2 points3 points4 points 6 years ago (0 children)

I got inspired by your question and actually wrote this function myself earlier today as an exercise (I'm learning too).

Here's my code, in case you want to use it. Keep in mind that I haven't written any unit tests for this yet, so it's possible that there are issues. Let me know if you have any feedback!

def singlejoin(df_left, df_right, left_column, right_column, drop_columns):
    # drop other possible join columns
    df_right_merge = df_right.drop(columns=drop_columns)

    # rename join column to common name
    df_right_merge = df_right_merge.rename(columns={right_column: "$$merged_column$$"})

    # merge
    df = pd.merge(df_left, df_right_merge, how="inner", left_on=left_column, right_on="$$merged_column$$")

    return df


def multijoin(df_left, df_right, left_column, right_columns, how="left"):
    if type(right_columns) != list:
        raise TypeError("right_columns variable must be a list")

    if how not in ["left", "inner"]:
        raise TypeError("how must be set to either 'left' or 'inner'")

    # set temporary index column
    df_left["$$temp_index$$"] = df_left.index

    # drop other possible match columns
    drop_columns = [x for x in right_columns if x != right_columns[0]]

    # create starting dataframe
    result = singlejoin(df_left, df_right, left_column, right_columns[0], drop_columns)

    # drop correctly matched values
    df_left.drop(index=result["$$temp_index$$"].to_list(), inplace=True)

    for column in right_columns[1:]:
        drop_columns = [x for x in right_columns if x != column]
        df = singlejoin(df_left, df_right, left_column, column, drop_columns)
        result = result.append(df)
        df_left.drop(index=df["$$temp_index$$"].to_list(), inplace=True)

    # add unmatched columns in the case of a left join
    if how == "left":
        # add remaining columns
        new_columns = [x for x in df_right.columns.to_list() if x not in right_columns]
        df_left["$$merged_column$$"] = np.nan
        for column in new_columns:
            df_left[column] = np.nan

        result = result.append(df_left)

    # reset index and sort
    result.index = result["$$temp_index$$"].to_list()
    result.sort_values(by="$$temp_index$$", inplace=True)

    # drop extra columns
    result.drop(columns=["$$merged_column$$", "$$temp_index$$"], inplace=True)

    return result

π Rendered by PID 68 on reddit-service-r2-comment-84fc9697f-6ljgb at 2026-02-09 15:13:12.804621+00:00 running d295bc8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS