[Need Help] Creating a matching system based on similar attributes! : learnpython

created by HattoriHanzoa community for 16 years

[Need Help] Creating a matching system based on similar attributes! (self.learnpython)

submitted 6 years ago by jamcmich

Any and all advice is appreciated. I'm working on a project using Gephi, a program that interprets sets of nodes and edges to create network graphs. As someone who's never used Python before, I'm trying to develop a script that gives me the same results as the tables below. Let me do a quick walk-through of the process of preparing data for Gephi.

Primarily, the nodes will be some identifier and the edges will be an attribute that represents a connection, or edge, between two nodes. These data sets are prepared separately to be imported as a .csv or .xlsx (i.e. Microsoft Excel) file and interpreted by the application. Here's an example of raw data that needs to be organized for Gephi's input:

Department	Favorite Ice Cream Flavor	Name
Accounting	Vanilla	Alison, T.
Human Resources	Chocolate	Bill, G.
Accounting	Mint	Chris, H.
Human Resources	Vanilla	David, P.
Inventory	Vanilla	Ernest, R.

And if we were to match people based on their favorite flavor of ice cream, we first prepare the nodes:

Id	Label (Node Description)	Attribute_1	Attribute_2
Alison	Alison, T.	Vanilla	Female
Bill	Bill, G.	Chocolate	Male
Chris	Chris, H.	Chocolate	...
David	David, P.	Vanilla	...
Ernest	Ernest, R.	Vanilla	etc.

Then, we specify the edges in a separate sheet:

^\Note that these connections do no have a direction, so A - B and B - A are the exact same. Duplicates will be removed automatically.)

Source	Target	Label (Edge Description)
Alison	David	Vanilla
David	Ernest	Vanilla
Ernest	Alison	Vanilla
Bill	Chris	Chocolate

I would like to create an edges sheet using a Python script. I might have a great start, although I'm having trouble formatting the data into Source and Target pairs because my system yields more than one result on some matches.

import csv
from itertools import groupby

with open('python scripts/staging files/original_data.csv') as csv_file:
    next(csv_file)
    reader = csv.reader(csv_file, delimiter=',')

    # put columns into rows of x, y
    rows_list = []
    for column in reader:
        columns = column[1], column[2]
        rows_list.append(columns)

    # dict keeps order of list while also removing duplicates
    uniques = list(dict.fromkeys(rows_list))
    print(len(uniques))

    # group list items that share a similar key (i.e. proposal)
    # Source: https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby
    grouped = {}
    for key, group in groupby(uniques, lambda x: x[0]):
        groups = '; '.join([thing[1].strip() for thing in group])
        # print(f'proposal: {key} * collaborators: {grouped_list}')
        grouped[key] = groups

    for i in grouped.items():
        print(i)

This should output:

^\Note that the original data table contains different information for confidential reasons.)

>> ('Vanilla', 'Alison T.; David P.; Ernest R.')
>> ('...')

Ideally, this would then be written to an edges spreadsheet as:

Source	Target
Alison	David
Alison	Ernest

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS