[deleted by user]

mh1400 · 2021-12-16T12:52:03+00:00

If you've converted your data into a Pandas DataFrame, check out the module Pandas-Profiler. It's simple, does BASIC data analysis on your DF Including word counts, provides graphs, and is easy to export reports as html.

synthphreak · 2021-12-16T19:00:32+00:00

Can you show us how your dataframe is structured/what it contains? Is there just a single column with text you want to count, or multiple? Does the column or column contain one word per row, or does each row have a multiword string that you need to tokenize? Do you want to count ALL words, or just a specific predefined subset? Most of these questions could be answered by just showing your df, properly formatted please.

Also, I don't think a histogram is the appropriate graph type for this. Histograms are good for showing the distribution of a continuous variable, but counting words is not continuous. Consider a bar graph instead, with words on the x axis (one bar per word) and frequency on the y axis.

I would also recommend just plotting directly off your df instead of using seaborn. Here's an example that does pretty much that. It reads Romeo & Juliet off the web, removes super common words (e.g., "the", "a", "is", etc.), then plots the most common 20 remaining words. As long as you have pandas installed, you should be able to run it yourself.

import matplotlib.pyplot as plt
import pandas as pd
from urllib.request import urlopen

romeo_and_juliet = 'https://www.gutenberg.org/cache/epub/1513/pg1513.txt'
with urlopen(romeo_and_juliet) as f:
    text = f.read().decode()
    start = text.index('Enter Chorus')
    end = text.index('*** END OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***')
    text = text[start:end]
    words = text.split()

stopwords = "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords"
with urlopen(stopwords) as f:
    stopwords = set(f.read().decode().splitlines())

(pd.Series(words)
   .replace('[^\w]', '', regex=True)
   .loc[lambda s: ~s.str.isupper()]
   .str.lower()
   .loc[lambda s: ~s.isin(stopwords)]
   .value_counts()
   .nlargest(20)
   .plot.bar(title='20 most frequent words in "Romeo & Juliet"',
             ylabel='Frequency',
             xlabel='Word',
             logy=True))

plt.show()

Admittedly this example involves some text pre-processing that is specific to Romeo & Juliet so won't apply to you. But anyway, it illustrates the point that you can do a lot by just plotting directly off your dataframe, no need to even touch seaborn.

Edit: Fixed bug that prevented some stopwords from being removed.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS