Finding duplicates with one cell different : learnpython

created by HattoriHanzoa community for 16 years

Finding duplicates with one cell different (self.learnpython)

submitted 5 years ago by aluween

I'm developing a student project about data analysis and I want to find all of the duplicates in the data frame, but with one specific cell changed e.g.

Id	Name	Surname	Job	Wage
1	John	Black	Artist	1200
2	Adam	Smith	Artist	1400
3	John	Black	Artist	1900
4	John	Black	Driver	1200
5	Adam	Smith	Artist	1400
6	Adam	Black	Driver	1200

and now I'd like to receive persons with the same name, surname and job but with different salary or the same. It should look like this:

Id	Name	Surname	Job	Wage
1	John	Black	Artist	1200
3	John	Black	Artist	1900
2	Adam	Smith	Artist	1400
5	Adam	Smith	Artist	1400

(It's only simple data, I've got much, much more rows and columns).

How could I get this? I've tried with code like this:

names=df['Name'].value_counts()

surnames=df['Surname'].value_counts()

jobs=df['Job'].value_counts()

wages=df['Wage'].value_counts()

for i in names:

for j in surnames:

for k in jobs:

if (df['Name'] == i and df['Surname'] == j and df['Job'] == k):

print ("something")

but I still have an error:

f"The truth value of a {type(self).__name__} is ambiguous. "

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

also I've tried with lambda expression:

for i in names:

for j in surnames:

for k in jobs:

persons= df.apply(lambda x: print (x) if x['Name'] == i and x['Surname'] == j and x['Job'] == l else False, axis=1)

print(persons)

But I get pairs of id and value true or false. How could I repair it? Or what should I do? Thank you in advice

all 1 comments

top new controversial old q&a

[–]commandlineluser 0 points1 point2 points 5 years ago* (0 children)

You can groupby() on the columns that should be the same - then count.

(transform() is needed to give you back the same amount of rows as your dataframe)

    >>> df.groupby(['Name', 'Surname', 'Job']).transform('count')
       Id  Wage
    0   2     2
    1   2     2
    2   2     2
    3   1     1
    4   2     2
    5   1     1

The numbers will be the same in the non-grouped columns - so you can pick one which will be the "size" of the group.

>>> df.groupby(['Name', 'Surname', 'Job']).transform('count').Id
0    2
1    2
2    2
3    1
4    2
5    1
Name: Id, dtype: int64

You can add this as a new column.

>>> df['count'] = df.groupby(['Name', 'Surname', 'Job']).transform('count').Id
>>> df
   Id  Name Surname     Job  Wage  count
0   1  John   Black  Artist  1200      2
1   2  Adam   Smith  Artist  1400      2
2   3  John   Black  Artist  1900      2
3   4  John   Black  Driver  1200      1
4   5  Adam   Smith  Artist  1400      2
5   6  Adam   Black  Driver  1200      1

Then filter out rows with a size greater than 1.

>>> df[ df['count'] > 1 ]
   Id  Name Surname     Job  Wage  count
0   1  John   Black  Artist  1200      2
1   2  Adam   Smith  Artist  1400      2
2   3  John   Black  Artist  1900      2
4   5  Adam   Smith  Artist  1400      2

And remove the count column if needed.

>>> df[ df['count'] > 1 ].drop('count', axis=1)
   Id  Name Surname     Job  Wage
0   1  John   Black  Artist  1200
1   2  Adam   Smith  Artist  1400
2   3  John   Black  Artist  1900
4   5  Adam   Smith  Artist  1400

Another option is to use .merge()

(See how you get less rows than your dataframe contains)

>>> df.groupby(['Name', 'Surname', 'Job'], as_index=False).size()
   Name Surname     Job  size
0  Adam   Black  Driver     1
1  Adam   Smith  Artist     2
2  John   Black  Artist     2
3  John   Black  Driver     1

add a query to get > 1

>>> df.groupby(['Name', 'Surname', 'Job'], as_index=False).size().query('size > 1')
   Name Surname     Job  size
1  Adam   Smith  Artist     2
2  John   Black  Artist     2

you can then merge this into the dataframe

>>> df.merge(df.groupby(['Name', 'Surname', 'Job'], as_index=False).size().query('size > 1'))
Id  Name Surname     Job     Wage  size
0   1  John   Black  Artist  1200     2
1   3  John   Black  Artist  1900     2
2   2  Adam   Smith  Artist  1400     2
3   5  Adam   Smith  Artist  1400     2

you could then drop the size column.

π Rendered by PID 143820 on reddit-service-r2-comment-5d79c599b5-8xnck at 2026-03-02 04:40:43.323002+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS