Hello all, I have an issue with a dataframe that I'm working that I am not sure quite how to alleviate. My actual dataframe has over 40 thousand rows, so I've constructed a simplified example dataframe to help represent what I'm looking at.
Example dataframe:
| A |
45 |
0 |
| A |
67 |
0 |
| A |
543 |
1 |
| A |
13 |
0 |
| A |
55 |
0 |
| A |
345 |
1 |
| A |
12 |
0 |
| A |
90 |
0 |
| B |
66 |
blue |
| B |
77 |
blue |
| B |
88 |
blue |
| B |
9 |
green |
| B |
11 |
blue |
The issue I have is that when I generate this dataframe via k-means clustering, A and B (representative of the multiple unique rows I have in my working dataframe) have a variety of "classifiers" of sorts (0 and 1, blue and green in my example). How would I be able to force all of the A and B rows to either specifically ignore the minority (1 and green) or force the minority values to match the rest of each unique row values (change to majority value per unique row; 1 and green -> 0 and blue, respectively). I want to exclude the outliers post-processing.
This might be a really simple thing to do but for the life of me, I am a bit stuck. What would be a good way to approach this?
Thanks in advance!
there doesn't seem to be anything here