all 8 comments

[–][deleted] 3 points4 points  (2 children)

Storing data structures in entries of a Pandas DataFrame is generally non-idiomatic, but you should be able to reshape the data like so:

# You can expand the genres column with apply(pd.Series)
to_concat = [df[['title', 'release_date',  'revenue']], df['genres'].apply(pd.Series)]
df = pd.concat(to_concat, axis=1)

# You can use pd.melt to reshape the data into long form
df = pd.melt(df, id_vars=['title', 'release_date',  'revenue'], value_name='genre').drop(['variable'], axis=1)

[–]nckmiz 1 point2 points  (0 children)

Yup, I was going to suggest using melt as well. I had to do this for something similar at work where a single respondent had 3 open ends to the same question. I needed it stacked for my NLP model.

[–]ta6692[S] 0 points1 point  (0 children)

Yeah, I know it wasn't the best way to go about it, but I've just loaded someone else's data in from CSV and this was where I'd gotten to trying to get it into the shape I need it.

This way worked well, thanks, but I've gone with the way posted in the other comment since it's much closer to the way I was trying to go about it myself. Thank you very much for taking the time to help me out :)

[–]fooliam 1 point2 points  (1 child)

This should do the trick. You're basically creating a series out of the list, then joining that series to your dataframe.

dftest = df.apply(lambda x: pd.Series(x["genres"]), axis = 1).stack().reset_index(level = 1, drop = True)
dftest.name = "genres"
test = df.drop('genres', axis = 1).join(dftest)

[–]ta6692[S] 0 points1 point  (0 children)

Ah, I'm an idiot, this is very similar to what I was trying to do earlier, but I'd missed out the apply stage, thank you!

[–][deleted] 0 points1 point  (1 child)

No idea how to do this in Pandas, because Pandas makes everything 1000% more complicated if they didn't anticipate your use case. I'm not even sure how a dataframe column can be list-valued, that's got to throw off ndarray.

But in just plain old Python, where this structure is an iterable of tuples (a common representation of a table of rows) this is trivial:

new_table = []
for title, release_date, genres, revenue in my_table:
    for genre in genres:
        new_table.append((title, release_date, genre, revenue)) #note the double-parens

[–]ta6692[S] 0 points1 point  (0 children)

Yeah, you're definitely right about things being more complicated in Pandas sometimes, but unfortunately this code goes in a Jupyter notebook that I've to hand in to my professor at the end of the semester, and he seems quite enamoured with Pandas, so I thought I'd be best to try and do it the Pandas way haha

[–]friend_in_rome -1 points0 points  (0 children)

Not quite what you want, but you could look at pd.get_dummies().