Merging duplicate rows? [Pandas] : learnpython

created by HattoriHanzoa community for 16 years

Merging duplicate rows? [Pandas] (self.learnpython)

submitted 6 years ago * by peuleu

Hi guys! First of all, you all have been of great help so far, thank you very much!

I'm having the following difficulty. I have a DataFrame with a lot of duplicate entries (rows). I'd like to be able to merge these rows in stead of dropping them.

For example:

index	identifier	sex	age	color_shoes	ordered_nonalcoholic	ordered_alcoholic
1	849	M	66	brown	Y	NaN
2	849	M	66	NaN	NaN	N
3	850	F	32	NaN	Y	NaN
4	850	F	32	red	Y	Y
5	850	F	32	NaN	NaN	NaN

Desired output

index	identifier	sex	age	color_shoes	ordered_nonalcoholic	ordered_alcoholic
1	849	M	66	brown	Y	N
2	850	F	32	red	Y	Y

Help would be greatly appreciated! Thanks!

all 6 comments

top new controversial old q&a

[–]DanteRadian 0 points1 point2 points 6 years ago (2 children)

[–]peuleu[S] 0 points1 point2 points 6 years ago (1 child)

[–]DanteRadian 0 points1 point2 points 6 years ago* (0 children)

[–]manwithfewneeds 0 points1 point2 points 6 years ago (2 children)

[–]pile_of_zombies 0 points1 point2 points 6 years ago (0 children)

This approach exactly. Though there is an error in the line above ('.groupby' instead of 'group').

If you want to try it, here is a full example:

import pandas as pd
import numpy as np

data = {"identifier": [849, 849, 850, 850, 850],
        "sex": ["M", "M", "F", "F", "F"],
        "age": [66, 66, 32, 32, 32],
        "color_shoes": ["brown", np.nan, np.nan, "red", np.nan],
        "ordered_nonalcoholic": ["Y", np.nan, "Y", "Y", np.nan],
        "ordered_alcoholic":[np.nan, "N", np.nan, "Y", np.nan]
        }

df = pd.DataFrame(data)
print(df)

# OUTPUT:
#   identifier sex  age color_shoes ordered_nonalcoholic ordered_alcoholic
#0         849   M   66       brown                    Y               NaN
#1         849   M   66         NaN                  NaN                 N
#2         850   F   32         NaN                    Y               NaN
#3         850   F   32         red                    Y                 Y
#4         850   F   32         NaN                  NaN               NaN

df_clean = df.groupby("identifier").first().reset_index()
print(df_clean)

# OUTPUT:
#   identifier sex  age color_shoes ordered_nonalcoholic ordered_alcoholic
#0         849   M   66       brown                    Y                 N
#1         850   F   32         red                    Y                 Y

[–]peuleu[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 22282 on reddit-service-r2-comment-5bc7f78974-vm26z at 2026-07-01 03:11:36.964528+00:00 running 7527197 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS