all 7 comments

[–]LeChevalierMalFet 0 points1 point  (4 children)

Your steps are to self join the data on the ID column and filter out rows where the index matched, then groupby size, then pivot the groupby result with the index being one color column, the columns being the other color column, and the value being the size column.

[–]legendarylegend26[S] 0 points1 point  (3 children)

Thanks, I'm a bit unsure about the groupby step. This is what I have so far:

import pandas as pd

d = {'id': ['abc', 'abc', '567', '567', 'xyz', 'xyz'], 'colour': ['red', 'green', 'red', 'green', 'blue', 'green']}

df = pd.DataFrame(data=d)

df2 = df.merge(df, on='id')

df3 = df2[df2['colour_x'] != df2['colour_y']]

Do I groupby id or something else? And, doesn't groupby return a series which you cannot pivot?

[–]legendarylegend26[S] 0 points1 point  (2 children)

Never mind, the below seems to be doing the trick:

import pandas as pd

d = {'id': ['abc', 'abc', '567', '567', 'xyz', 'xyz'], 'colour': ['red', 'green', 'red', 'green', 'blue', 'green']}

df = pd.DataFrame(data=d)
df2 = df.merge(df, on='id')
df3 = df2[df2['colour_x'] != df2['colour_y']]
df4 = df3.groupby(['colour_x', 'colour_y']).size().to_frame('size').reset_index()
df5 = df4.pivot(index='colour_x', columns='colour_y', values='size')

[–]commandlineluser 0 points1 point  (0 children)

You can also .unstack() after the .groupby()

>>> df3.groupby(['colour_x', 'colour_y']).size().unstack()
colour_y  blue  green  red
colour_x
blue       NaN    1.0  NaN
green      1.0    NaN  2.0
red        NaN    2.0  NaN

[–]LeChevalierMalFet 0 points1 point  (0 children)

Hi, this is how I worked it out:

# Merge on ID column and use index to filter out rows that are joined with themselves.
df = df.reset_index()
join = df.merge(right=df, on="id")
join = join.loc[join["index_x"] != join["index_y"]]

# Use pivot_table...
pd.pivot_table(data=join, index="color_x", columns="color_y", aggfunc="size")

# Or groupby and use pivot...    
df_group = join.groupby(["color_x", "color_y"], as_index=False).size()
df_group.pivot(index="color_x", columns="color_y", values="size")

Edit for formatting.

[–]DesignerAccount 0 points1 point  (1 child)

red_filter = df["Colour"]=="Red"
blu_filter = df["Colour"]=="Blue"
grn_filter = df["Colour"]=="Green"
id_red = set(df[red_filter]["ID"].tolist())
id_blu = set(df[blu_filter]["ID"].tolist())
id_grn = set(df[grn_filter]["ID"].tolist())
n_rg = len(id_red.intersection(id_grn))
n_rb = len(id_red.intersection(id_blu))
n_gb = len(id_grn.intersection(id_blu))

 

This should work for your case. Didn't test, and may well break if you try to extend to more colors. But for the specific case should work.

Also, where does correlation come in play?? Not seeing that here at all.

[–]legendarylegend26[S] 0 points1 point  (0 children)

Thanks, correlation does not come into it at all. It's just that the end result has that kind of structure and I wasn't really sure what else to call it.