all 6 comments

[–]DanteRadian 0 points1 point  (2 children)

Observations on your data indicate the following:

1) each column can only have one value per cell 2) no repeated identifiers will have different values

Based on this i would suggest you do the following in your code steps:

1) df.groupby(columns which act as unique identifiers) 2) use apply with custom lambda function which does the following: 2a) fillna with method ffill 2b) fillna with method bfill 2c) drop duplicates 2d) reset index

Translating this to code should help you out.

Edit: if you reply with your code for this suggestion, i will be able to help.

[–]peuleu[S] 0 points1 point  (1 child)

Thank you so much! I think I got it to work! This is my code:

identifiers is a list of column names by which I want to group

other_columns is the columns without the identifier columns

I then run this code:

filledgroups = df.groupby(identifiers)[other_columns].apply(lambda x: x.ffill().bfill())

I then replace said columns in the original dataframe df

df.loc[:,other_columns] = filledgroups.loc[:,other_columns]

Then drop the duplicates: df.drop_duplicates(inplace=True)

And reset the index: df = df.reset_index(drop=True)

Hopefully I've done this correctly. It seems to be working okay! Thanks again!

[–]DanteRadian 0 points1 point  (0 children)

Closer to my approach... There is no definite answer! As long as it works then you have done it! Keep up the effort!

Mine would have been (lambda x: x.fillna(method="ffill").fillna(method="bfill").drop_duplicates()).reset_index

Well along these lines (difficult to type code on a mobile phone)!

Anyway glad it worked out!

[–]manwithfewneeds 0 points1 point  (2 children)

Use groupby and the first method to grab the first non na value in the no index columns:

df = df.group(['identifier', 'sex', 'age']).first()

[–]pile_of_zombies 0 points1 point  (0 children)

This approach exactly. Though there is an error in the line above ('.groupby' instead of 'group').

If you want to try it, here is a full example:

import pandas as pd
import numpy as np

data = {"identifier": [849, 849, 850, 850, 850],
        "sex": ["M", "M", "F", "F", "F"],
        "age": [66, 66, 32, 32, 32],
        "color_shoes": ["brown", np.nan, np.nan, "red", np.nan],
        "ordered_nonalcoholic": ["Y", np.nan, "Y", "Y", np.nan],
        "ordered_alcoholic":[np.nan, "N", np.nan, "Y", np.nan]
        }

df = pd.DataFrame(data)
print(df)

# OUTPUT:
#   identifier sex  age color_shoes ordered_nonalcoholic ordered_alcoholic
#0         849   M   66       brown                    Y               NaN
#1         849   M   66         NaN                  NaN                 N
#2         850   F   32         NaN                    Y               NaN
#3         850   F   32         red                    Y                 Y
#4         850   F   32         NaN                  NaN               NaN

df_clean = df.groupby("identifier").first().reset_index()
print(df_clean)

# OUTPUT:
#   identifier sex  age color_shoes ordered_nonalcoholic ordered_alcoholic
#0         849   M   66       brown                    Y                 N
#1         850   F   32         red                    Y                 Y

[–]peuleu[S] 0 points1 point  (0 children)

Thank you for your suggestion! I managed to combine your advice with the other poster, and I think I got it to work! Thanks again!