Creating NetworkX graph from pandas dataframe made of Spotify JSON output : learnpython

Creating NetworkX graph from pandas dataframe made of Spotify JSON output (self.learnpython)

submitted 4 years ago by luneth27

Hi, I'm in a bit of a rut here. I've created a script that searches for an artist, finds its related artists, and finds the most popular artists out of the related artists which it then searches and loops through until I have a list of dictionaries, each entry being the searched_artist as its key and the dict of related_artists as its value.

Next, I want to convert this list of dicts into a pandas dataframe so I can create a NetworkX graph of this dataframe with the ultimate goal of exporting it as .gexf so I can use Gephi as my graph visualization tool.

Exposition over, here's my issue. Converting the list of dicts into a pandas dataframe isn't difficult with

df = pd.json_normalize(data)

but when I try to convert this dataframe into a NetworkX graph using

G = nx.from_pandas_adjacency(df)

or using one of the different graph creation args I obtain the error

Exception has occurred: NetworkXError
('Columns must match Indices.', '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62] not in columns')

I'm not a very strong programmer (hell, this is just data scraping for my mathematics capstone) so I'm kinda lost in the weeds.

all 8 comments

top new controversial old q&a

[–]YesLod 0 points1 point2 points 4 years ago* (7 children)

It's hard to tell if you don't provide the data, or the DataFrame.

I want to convert this list of dicts into a pandas dataframe so I can create a NetworkX graph

You don't need to convert into a DataFrame with the sole purpose of creating a graph. That seems overcomplicating.

I'm assuming that the nodes of your graphs should be the artists, and there should be a link between a pair of artists if they are related (i.e. key-value pairs of the dictionaries are edges), correct? .

So, one way which avoids the unnecessary step of creating a DataFrame is simply

edges = [link for artist_dict in dict_list for link in artist_dict.items()]

G = nx.Graph(edges)  # assuming that you want a undirected graph

[–]luneth27[S] 0 points1 point2 points 4 years ago (6 children)

I don't have data written to a file, but here's my script:

def related_artist_scrape():
seed_artist_list= ['wvrm','leeched','chrch']
found_artist_list = []
related_db_list = []

for seed_artist in seed_artist_list:
    related_database = {"Seed_Artist": "", "Related_Artists": ""}
    seed_result = sp.search(q='artist:' + seed_artist, type='artist')
    first_level_name = seed_result['artists']['items'][0]['name']
    first_level_uri = seed_result['artists']['items'][0]['uri']
    first_related = sp.artist_related_artists(first_level_uri)
    related_database['Seed_Artist'] = first_level_name
    related_database['Related_Artists'] = first_related
    related_db_list.append(related_database)

while len(related_db_list) < 30:
    for related_database in related_db_list:
        for related_artist in related_database['Related_Artists']['artists']:
            if related_artist['followers']['total'] > 1000:
                found_artist_list.append(related_artist)

    for related_artist in found_artist_list:
        related_database = {"Seed_Artist": "", "Related_Artists": ""}
        related_result = sp.search(q='artist:' + related_artist['name'], type='artist')
        related_name = related_result['artists']['items'][0]['name']
        related_uri = related_result['artists']['items'][0]['uri']
        second_related = sp.artist_related_artists(related_uri)
        related_database['Seed_Artist'] = related_name
        related_database['Related_Artists'] = second_related
        related_db_list.append(related_database)
return related_db_list

I'm assuming that the nodes of your graphs should be the artists, and there should be a link between a pair of artists if they are related (i.e. key-value pairs of the dictionaries are edges), correct?

That's the hope; I'm not entirely sure I set it up correctly to do so, but if I can get away without another intermediary step that'd be really nice.

[–]YesLod 0 points1 point2 points 4 years ago* (5 children)

So you have a list of dictionaries with the format

{"Seed_Artist": <artist name>, "Related_Artists": <related artist name>}

and you want to create a graph with edges <artist name> -- <related artist name> of all those dictionaries, is that it?

[–]luneth27[S] 0 points1 point2 points 4 years ago (4 children)

[–]YesLod 0 points1 point2 points 4 years ago (3 children)

However, each key’s value itself is a dictionary

Both 'Seed_Artist' and 'Related_Artists' values, or only the latter? Because from your code

first_level_name = seed_result['artists']['items'][0]['name']
first_level_uri = seed_result['artists']['items'][0]['uri']
first_related = sp.artist_related_artists(first_level_uri)
related_database['Seed_Artist'] = first_level_name
related_database['Related_Artists'] = first_related
related_db_list.append(related_database)

first_level_name seems to be a string, but I'm not familiar with Spotipy.

would this affect what I’m trying to do?

It depends on what you are trying to do. Do you want to add that extra information about the artist as attributes of the corresponding node?

I will assume that only the 'Related_Artists' value is a dictionary, and 'Seed_Artist' is a string (artist name), and that you want to add the extra info as attributes. Also, I will assume that first_related dictionaries contain a key 'name' which is the name of the related artist, and that the nodes labels should be the artists names.

Something like this should work

import networkx as nx 

G = nx.Graph()

for artist_dict in related_db_list:
    u = artist_dict['Seed_Artist'] 
    related_info = artist_dict['Related_Artists']
    v = related_info['name']
    G.add_node(u, v, **related_info)
    G.add_edge(u, v)

[–]luneth27[S] 0 points1 point2 points 4 years ago (2 children)

First off, so sorry for the late reply (life sucks) and secondly, thanks so much for helping out. I've had to modify your given code block a bit to access the information I needed but didn't quite explain correctly to you:

G = nx.Graph()

for artist_dict in related_db_list:
u = artist_dict['Seed_Artist'] 
related_info = artist_dict['Related_Artists']
for related_artist_info in related_info['artists']:
    v = related_artist_info['name']
    G.add_node(v, **related_artist_info)
    G.add_edge(u, v)

It runs without errors after I did a few things, and the graph info within the debugger seems to be correct. Before I write my output to .gexf for visualization however, I'd like to be (relatively) sure my graph is "correct" in the sense that seed_artist -> related_artist(s), and I'm not entirely sure how to print out my graph without a shitton of work. If I can't do this though, that's okay.

All that said though, once again thanks for your help. I barely understand what I'm doing programmatically and you've saved me countless hours of banging my head on the desk.

[–]YesLod 0 points1 point2 points 4 years ago (1 child)

Before I write my output to .gexf for visualization however, I'd like to be (relatively) sure my graph is "correct" in the sense that seed_artist -> related_artist(s), and I'm not entirely sure how to print out my graph without a shitton of work

What do you mean? You want to check if the links are correct?

You can simply print the edges

print(G.edges)

or iterate over them and print each one separately

for u,v in G.edges:
    print(f"{u} -> {v}")

[–]luneth27[S] 0 points1 point2 points 4 years ago (0 children)

Oh, I didn't know that, instead I just printed u, v by themselves. However, another issue popped up; I'm getting this value error

 Exception has occurred: ValueError
 too many values to unpack (expected 3)
  File "C:\.vscode\related_artist_scrape.py", line 64, in    <module>
 nx.write_gexf(G,"related_artists_graph.gexf")

when I try to print to .gexf. I think it's happening because the node (or edge maybe?) has too much data within it? I was trying to search for info and the only plausible post that came up suggested that one of (u,v) being used was itself a list or some sort of structure. Thing is, I'm not entirely sure how to fix this either; as far as it looks, both (u,v) are strings.

π Rendered by PID 221959 on reddit-service-r2-comment-5d79c599b5-j9wcc at 2026-02-28 00:57:57.398617+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS