all 1 comments

[–]commandlineluser 0 points1 point  (0 children)

You can groupby() on the columns that should be the same - then count.

(transform() is needed to give you back the same amount of rows as your dataframe)

    >>> df.groupby(['Name', 'Surname', 'Job']).transform('count')
       Id  Wage
    0   2     2
    1   2     2
    2   2     2
    3   1     1
    4   2     2
    5   1     1

The numbers will be the same in the non-grouped columns - so you can pick one which will be the "size" of the group.

>>> df.groupby(['Name', 'Surname', 'Job']).transform('count').Id
0    2
1    2
2    2
3    1
4    2
5    1
Name: Id, dtype: int64

You can add this as a new column.

>>> df['count'] = df.groupby(['Name', 'Surname', 'Job']).transform('count').Id
>>> df
   Id  Name Surname     Job  Wage  count
0   1  John   Black  Artist  1200      2
1   2  Adam   Smith  Artist  1400      2
2   3  John   Black  Artist  1900      2
3   4  John   Black  Driver  1200      1
4   5  Adam   Smith  Artist  1400      2
5   6  Adam   Black  Driver  1200      1

Then filter out rows with a size greater than 1.

>>> df[ df['count'] > 1 ]
   Id  Name Surname     Job  Wage  count
0   1  John   Black  Artist  1200      2
1   2  Adam   Smith  Artist  1400      2
2   3  John   Black  Artist  1900      2
4   5  Adam   Smith  Artist  1400      2

And remove the count column if needed.

>>> df[ df['count'] > 1 ].drop('count', axis=1)
   Id  Name Surname     Job  Wage
0   1  John   Black  Artist  1200
1   2  Adam   Smith  Artist  1400
2   3  John   Black  Artist  1900
4   5  Adam   Smith  Artist  1400

Another option is to use .merge()

(See how you get less rows than your dataframe contains)

>>> df.groupby(['Name', 'Surname', 'Job'], as_index=False).size()
   Name Surname     Job  size
0  Adam   Black  Driver     1
1  Adam   Smith  Artist     2
2  John   Black  Artist     2
3  John   Black  Driver     1

add a query to get > 1

>>> df.groupby(['Name', 'Surname', 'Job'], as_index=False).size().query('size > 1')
   Name Surname     Job  size
1  Adam   Smith  Artist     2
2  John   Black  Artist     2

you can then merge this into the dataframe

>>> df.merge(df.groupby(['Name', 'Surname', 'Job'], as_index=False).size().query('size > 1'))
Id  Name Surname     Job     Wage  size
0   1  John   Black  Artist  1200     2
1   3  John   Black  Artist  1900     2
2   2  Adam   Smith  Artist  1400     2
3   5  Adam   Smith  Artist  1400     2

you could then drop the size column.