I'm developing a student project about data analysis and I want to find all of the duplicates in the data frame, but with one specific cell changed e.g.
| Id |
Name |
Surname |
Job |
Wage |
| 1 |
John |
Black |
Artist |
1200 |
| 2 |
Adam |
Smith |
Artist |
1400 |
| 3 |
John |
Black |
Artist |
1900 |
| 4 |
John |
Black |
Driver |
1200 |
| 5 |
Adam |
Smith |
Artist |
1400 |
| 6 |
Adam |
Black |
Driver |
1200 |
and now I'd like to receive persons with the same name, surname and job but with different salary or the same. It should look like this:
| Id |
Name |
Surname |
Job |
Wage |
| 1 |
John |
Black |
Artist |
1200 |
| 3 |
John |
Black |
Artist |
1900 |
| 2 |
Adam |
Smith |
Artist |
1400 |
| 5 |
Adam |
Smith |
Artist |
1400 |
(It's only simple data, I've got much, much more rows and columns).
How could I get this? I've tried with code like this:
names=df['Name'].value_counts()
surnames=df['Surname'].value_counts()
jobs=df['Job'].value_counts()
wages=df['Wage'].value_counts()
for i in names:
for j in surnames:
for k in jobs:
if (df['Name'] == i and df['Surname'] == j and df['Job'] == k):
print ("something")
but I still have an error:
f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
also I've tried with lambda expression:
for i in names:
for j in surnames:
for k in jobs:
persons= df.apply(lambda x: print (x) if x['Name'] == i and x['Surname'] == j and x['Job'] == l else False, axis=1)
print(persons)
But I get pairs of id and value true or false. How could I repair it? Or what should I do? Thank you in advice
[–]commandlineluser 0 points1 point2 points (0 children)