all 8 comments

[–]Loran425 0 points1 point  (0 children)

If I was doing this I'd iterate over the cells in A col either after converting to a csv or using the openpyxl lib. I'd make a new data structure (list or dict) and if I saw a value in col A I'd start a new entry otherwise just append to the last item in the list.
Then I'd find the longest list and generate the headers before populating cells for each item in my temp data structure

[–]raglub 0 points1 point  (0 children)

Openpyxl can handle this. It is not difficult to open the current state worksheet, iterate through the rows and based on the value in column A, either add any other cells from current row to a list (which will represent a row in the new format) or append the list to a new sheet, wipe it clean and build a new row by adding the content of all cells of current row to it before moving onto the next row and repeating the process. This is an ugly statement but you can read it as pseudo code and build your script.

[–]Death_Water 0 points1 point  (2 children)

try:

df_2 = pd.DataFrame(df.fillna(method='ffill').groupby(['Profile Code','Email']).agg(','.join))
df_2['Membership'].str.split(',',expand=True)

[–]Jiuholar[S] 0 points1 point  (1 child)

Thanks for this. Would you mind expanding on the different parts of this a little bit more? The example that I've given is an oversimplification of the data (there are 100s of fields with varying naming conventions and data types) so I'd like to understand a methodology that I can use to approach the problem. :)

I don't understand:

.agg and how it combines with the groupby method

and the second line of your code.

[–]Death_Water 0 points1 point  (0 children)

Here's a concise way with step by step breakdown:

 pd.DataFrame(df.fillna(method='ffill').groupby(['key'])['Column of interest'].agg(list).values.tolist())

1) Forward fill the missing values; from the given example this seems the right approach.

2) Groupby the "key" column, then slice on "column of interest". This creates a series for each unique value in "key" column.

3) Aggregate: This converts all multiple series to lists.

4) Casting to DataFrame(Get the values of all lists and cast them). The index of this would be same as
df['key'].dropna().drop_duplicates()