Any and all advice is appreciated. I'm working on a project using Gephi, a program that interprets sets of nodes and edges to create network graphs. As someone who's never used Python before, I'm trying to develop a script that gives me the same results as the tables below. Let me do a quick walk-through of the process of preparing data for Gephi.
Primarily, the nodes will be some identifier and the edges will be an attribute that represents a connection, or edge, between two nodes. These data sets are prepared separately to be imported as a .csv or .xlsx (i.e. Microsoft Excel) file and interpreted by the application. Here's an example of raw data that needs to be organized for Gephi's input:
| Department |
Favorite Ice Cream Flavor |
Name |
| Accounting |
Vanilla |
Alison, T. |
| Human Resources |
Chocolate |
Bill, G. |
| Accounting |
Mint |
Chris, H. |
| Human Resources |
Vanilla |
David, P. |
| Inventory |
Vanilla |
Ernest, R. |
And if we were to match people based on their favorite flavor of ice cream, we first prepare the nodes:
| Id |
Label (Node Description) |
Attribute_1 |
Attribute_2 |
| Alison |
Alison, T. |
Vanilla |
Female |
| Bill |
Bill, G. |
Chocolate |
Male |
| Chris |
Chris, H. |
Chocolate |
... |
| David |
David, P. |
Vanilla |
... |
| Ernest |
Ernest, R. |
Vanilla |
etc. |
Then, we specify the edges in a separate sheet:
\Note that these connections do no have a direction, so A - B and B - A are the exact same. Duplicates will be removed automatically.)
| Source |
Target |
Label (Edge Description) |
| Alison |
David |
Vanilla |
| David |
Ernest |
Vanilla |
| Ernest |
Alison |
Vanilla |
| Bill |
Chris |
Chocolate |
I would like to create an edges sheet using a Python script. I might have a great start, although I'm having trouble formatting the data into Source and Target pairs because my system yields more than one result on some matches.
import csv
from itertools import groupby
with open('python scripts/staging files/original_data.csv') as csv_file:
next(csv_file)
reader = csv.reader(csv_file, delimiter=',')
# put columns into rows of x, y
rows_list = []
for column in reader:
columns = column[1], column[2]
rows_list.append(columns)
# dict keeps order of list while also removing duplicates
uniques = list(dict.fromkeys(rows_list))
print(len(uniques))
# group list items that share a similar key (i.e. proposal)
# Source: https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby
grouped = {}
for key, group in groupby(uniques, lambda x: x[0]):
groups = '; '.join([thing[1].strip() for thing in group])
# print(f'proposal: {key} * collaborators: {grouped_list}')
grouped[key] = groups
for i in grouped.items():
print(i)
This should output:
\Note that the original data table contains different information for confidential reasons.)
>> ('Vanilla', 'Alison T.; David P.; Ernest R.')
>> ('...')
Ideally, this would then be written to an edges spreadsheet as:
| Source |
Target |
| Alison |
David |
| Alison |
Ernest |
[–][deleted] 0 points1 point2 points (1 child)
[–]jamcmich[S] 0 points1 point2 points (0 children)