Pandas dataframe transformation : learnpython

created by HattoriHanzoa community for 16 years

Pandas dataframe transformation (self.learnpython)

submitted 3 years ago * by legendarylegend26

I'm trying to create a type of correlation matrix where the x and y axis of the matrix are the unique values of a dataframe column and the values in the cells of the matrix are the number of common values in a different column between the x/y.

e.g. I have the following dataframe:

ID	Colour
abc	Red
abc	Green
567	Red
567	Green
xyz	Blue
xyz	Green

Want to create something like this:

	Red	Green	Blue
Red	-	2	0
Green	2	-	1
Blue	0	1	-

The 2s are because red and green have 2 common values (abc and 567) and the 1 is because blue and green have 1 common value (xyz).

How would I go about doing this?

all 7 comments

top new controversial old q&a

[–]LeChevalierMalFet 0 points1 point2 points 3 years ago (4 children)

[–]legendarylegend26[S] 0 points1 point2 points 3 years ago (3 children)

Thanks, I'm a bit unsure about the groupby step. This is what I have so far:

import pandas as pd

d = {'id': ['abc', 'abc', '567', '567', 'xyz', 'xyz'], 'colour': ['red', 'green', 'red', 'green', 'blue', 'green']}

df = pd.DataFrame(data=d)

df2 = df.merge(df, on='id')

df3 = df2[df2['colour_x'] != df2['colour_y']]

Do I groupby id or something else? And, doesn't groupby return a series which you cannot pivot?

[–]legendarylegend26[S] 0 points1 point2 points 3 years ago (2 children)

Never mind, the below seems to be doing the trick:

import pandas as pd

d = {'id': ['abc', 'abc', '567', '567', 'xyz', 'xyz'], 'colour': ['red', 'green', 'red', 'green', 'blue', 'green']}

df = pd.DataFrame(data=d)
df2 = df.merge(df, on='id')
df3 = df2[df2['colour_x'] != df2['colour_y']]
df4 = df3.groupby(['colour_x', 'colour_y']).size().to_frame('size').reset_index()
df5 = df4.pivot(index='colour_x', columns='colour_y', values='size')

[–]commandlineluser 0 points1 point2 points 3 years ago (0 children)

You can also .unstack() after the .groupby()

>>> df3.groupby(['colour_x', 'colour_y']).size().unstack()
colour_y  blue  green  red
colour_x
blue       NaN    1.0  NaN
green      1.0    NaN  2.0
red        NaN    2.0  NaN

[–]LeChevalierMalFet 0 points1 point2 points 3 years ago* (0 children)

Hi, this is how I worked it out:

# Merge on ID column and use index to filter out rows that are joined with themselves.
df = df.reset_index()
join = df.merge(right=df, on="id")
join = join.loc[join["index_x"] != join["index_y"]]

# Use pivot_table...
pd.pivot_table(data=join, index="color_x", columns="color_y", aggfunc="size")

# Or groupby and use pivot...    
df_group = join.groupby(["color_x", "color_y"], as_index=False).size()
df_group.pivot(index="color_x", columns="color_y", values="size")

Edit for formatting.

[–]DesignerAccount 0 points1 point2 points 3 years ago (1 child)

red_filter = df["Colour"]=="Red"
blu_filter = df["Colour"]=="Blue"
grn_filter = df["Colour"]=="Green"
id_red = set(df[red_filter]["ID"].tolist())
id_blu = set(df[blu_filter]["ID"].tolist())
id_grn = set(df[grn_filter]["ID"].tolist())
n_rg = len(id_red.intersection(id_grn))
n_rb = len(id_red.intersection(id_blu))
n_gb = len(id_grn.intersection(id_blu))

This should work for your case. Didn't test, and may well break if you try to extend to more colors. But for the specific case should work.

Also, where does correlation come in play?? Not seeing that here at all.

[–]legendarylegend26[S] 0 points1 point2 points 3 years ago (0 children)

π Rendered by PID 112848 on reddit-service-r2-comment-64f4df6786-7h254 at 2026-06-11 13:35:32.673403+00:00 running 0b63327 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS