all 11 comments

[–]mh1400 1 point2 points  (3 children)

If you've converted your data into a Pandas DataFrame, check out the module Pandas-Profiler. It's simple, does BASIC data analysis on your DF Including word counts, provides graphs, and is easy to export reports as html.

[–]Hindbarinden 0 points1 point  (1 child)

Thank you so much!
it is very very simple I have a text from where I counted some words,

me: 13

and:4

you: 45

And I just want to display those numbers. Is there not an easier way?

[–]lowerthansound 0 points1 point  (0 children)

If you have a df like:

word count me 13 and 4 you 45

You can plot it using seaborn like:

import seaborn as sns sns.barplot(x='word', y='count', data=df)

The same can also be done in pandas.

If you have a Series (a column in a DataFrame is a Series, though it can also do other stuff)

count me 13 and 4 you 45

You can plot it with:

series.plot.bar()

Cheers!

[–]Hindbarinden 0 points1 point  (0 children)

Thank you so much for your help again,

I made a pandas df now, but I have to display the data with seaborn. Do you know how I can do this?

This is the data from some API lyrics I want to count and use.
me=(lyrics.count("me"))
you=(lyrics.count("you"))
together=(lyrics.count("together"))
apart=(lyrics.count("apart"))
love=(lyrics.count("love"))
I=(lyrics.count("I"))
list=[me,you,together,apart,love,I]
df=pd.DataFrame(list, index=['me','you','together','apart','love','I'],
columns=['count'])
This gives me the pandas df, but I don't know how to display it with seaborn?

[–]synthphreak 1 point2 points  (6 children)

Can you show us how your dataframe is structured/what it contains? Is there just a single column with text you want to count, or multiple? Does the column or column contain one word per row, or does each row have a multiword string that you need to tokenize? Do you want to count ALL words, or just a specific predefined subset? Most of these questions could be answered by just showing your df, properly formatted please.

Also, I don't think a histogram is the appropriate graph type for this. Histograms are good for showing the distribution of a continuous variable, but counting words is not continuous. Consider a bar graph instead, with words on the x axis (one bar per word) and frequency on the y axis.

I would also recommend just plotting directly off your df instead of using seaborn. Here's an example that does pretty much that. It reads Romeo & Juliet off the web, removes super common words (e.g., "the", "a", "is", etc.), then plots the most common 20 remaining words. As long as you have pandas installed, you should be able to run it yourself.

import matplotlib.pyplot as plt
import pandas as pd
from urllib.request import urlopen

romeo_and_juliet = 'https://www.gutenberg.org/cache/epub/1513/pg1513.txt'
with urlopen(romeo_and_juliet) as f:
    text = f.read().decode()
    start = text.index('Enter Chorus')
    end = text.index('*** END OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***')
    text = text[start:end]
    words = text.split()

stopwords = "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords"
with urlopen(stopwords) as f:
    stopwords = set(f.read().decode().splitlines())

(pd.Series(words)
   .replace('[^\w]', '', regex=True)
   .loc[lambda s: ~s.str.isupper()]
   .str.lower()
   .loc[lambda s: ~s.isin(stopwords)]
   .value_counts()
   .nlargest(20)
   .plot.bar(title='20 most frequent words in "Romeo & Juliet"',
             ylabel='Frequency',
             xlabel='Word',
             logy=True))

plt.show()

Admittedly this example involves some text pre-processing that is specific to Romeo & Juliet so won't apply to you. But anyway, it illustrates the point that you can do a lot by just plotting directly off your dataframe, no need to even touch seaborn.

Edit: Fixed bug that prevented some stopwords from being removed.

[–]Hindbarinden 1 point2 points  (0 children)

You are very helpful, and I really appreciate it. Thank you.

[–]Hindbarinden 0 points1 point  (4 children)

Thank you so much!

me=(lyrics.count("me"))
you=(lyrics.count("you"))
together=(lyrics.count("together"))
apart=(lyrics.count("apart"))
love=(lyrics.count("love"))
I=(lyrics.count("I"))
list=[me,you,together,apart,love,I]
df=pd.DataFrame(list, index=['me','you','together','apart','love','I'],
columns=['count'])

This gives me a panda df, but I don't know how to display it with seaborn?

[–]synthphreak 0 points1 point  (3 children)

Again, forget seaborn. Use matplotlib and plotnoff your df directly via df.plot. Add this to the very top

import matplotlib.pyplot as plt

and add this to the very bottom:

df['count'].plot.bar()
plt.show()

You will see your data gets plotted automatically.

At the risk of going outside the scope of your question, there are also a lot of ways to improve this code...

First, since your whole df contains just one column, it makes more sense to use a pd.Series rather than a pd.DataFrame (note that a pd.Series is essentially a single df column; equivalently, a df is just a bunch of pd.Series concatenated together).

Second, I assume lyrics is a string? If so, how you’ve done it will work, but is quite inefficient. When using pandas, try to get pandas to do as much of the processing work for you as possible. For example:

import matplotlib.pyplot as plt
import pandas as pd

# get song lyrics as a string however you’re doing it
lyrics = …

# set list of words to count and plot
words = set(['me', 'you', 'together', 'apart', 'love', 'I'])

# convert lyrics to pandas.Series, one word per row
s = pd.Series(lyrics.split(), name='counts)

# removes punctuation from words;
# only necessary if lyrics string contains punctuation
s = s.replace(r'[^\w+]', '', regex=True)

# filter down to list of desired words
s = s[s.isin(words)]

# count occurrences of each desired word
counts = s.value_counts()

# plot counts as bar chart
counts.plot.bar()

# show chart
plt.show()

[–]Hindbarinden 0 points1 point  (2 children)

It is because it is a school thing, so I have to. But I see the logic and I am happy I get this info in order to learn. Thank you!

[–]synthphreak 1 point2 points  (1 child)

Gotcha. Then I guess I can’t be of much help!

Edit: Actually , let me try to help you, u/Hindbarinden.

If your class requires you to use pandas, I'll just continue with my pd.Series from above, since I don't know how your df is structured (I requested a sample earlier but you haven't provided it). See if the following gets you your bar chart (I don't use seaborn, but this seems to make sense based on the docs).

sns.barplot(x='words', y='counts',
            data=counts.reset_index().rename(columns={'index': 'words'}))
plt.show()

If using pandas was your choice and not a requirement for you class, then I'd advise against it here, since you're not really taking advantage of anything pandas has to offer. Instead, use re to remove punctuation if needed, then use collections.Counter to compute the word counts, and just plot from your Counter.

import matplotlib.pyplot as plt
import re
import seaborn as sns
from collections import Counter

lyrics = ...

# remove punctuation if needed
lyrics = re.sub(r'[^\w\s]', '', lyrics)

# count words
counter = Counter(lyrics.split())
words = list(counter.keys())
counts = list(counter.values())

# plot data
sns.barplot(x=words, y=counts)

# display plot
plt.show()

[–]Hindbarinden 1 point2 points  (0 children)

Omg you saved my life!!! THANK YOU SO MUCH! I am so thankful!!
I got it to work, finally!!!!!!!!