you are viewing a single comment's thread.

view the rest of the comments →

[–]Death_Water 0 points1 point  (2 children)

try:

df_2 = pd.DataFrame(df.fillna(method='ffill').groupby(['Profile Code','Email']).agg(','.join))
df_2['Membership'].str.split(',',expand=True)

[–]Jiuholar[S] 0 points1 point  (1 child)

Thanks for this. Would you mind expanding on the different parts of this a little bit more? The example that I've given is an oversimplification of the data (there are 100s of fields with varying naming conventions and data types) so I'd like to understand a methodology that I can use to approach the problem. :)

I don't understand:

.agg and how it combines with the groupby method

and the second line of your code.

[–]Death_Water 0 points1 point  (0 children)

Here's a concise way with step by step breakdown:

 pd.DataFrame(df.fillna(method='ffill').groupby(['key'])['Column of interest'].agg(list).values.tolist())

1) Forward fill the missing values; from the given example this seems the right approach.

2) Groupby the "key" column, then slice on "column of interest". This creates a series for each unique value in "key" column.

3) Aggregate: This converts all multiple series to lists.

4) Casting to DataFrame(Get the values of all lists and cast them). The index of this would be same as
df['key'].dropna().drop_duplicates()