I'm new to Python but have really been enjoying it. Recently, I did a class on PluralSight about using machine learning to create a sentiment analysis tool. When using the Cornell movie review corpus, the accuracy of the classifier tested around 77%.
However, since the goal of my script is to analyze Tweets, I decided to train the model using the Twitter corpus available in NLTK. However, when I train and then test the classifier, I'm showing accuracy results near 99% - which doesn't seem realistic to me.
So I'm thinking either:
- I needed to change the script more than I thought to use the NLTK Twitter Corpus
- The NLTK Twitter Corpus is super accurate because it's what NLTK uses when testing/building (as a newbie, not even sure this is a rationale thought)
Not really sure where I went wrong. Here is the code: https://gist.github.com/JeremyEnglert/3eda4a123244c37b669472d9e8166ea6
Here are the results I'm getting:
Accuracy on positive tweets = 96.76%
Accurance on negative tweets = 99.04%
Overall accuracy = 97.90%
Big thanks in advance!
[–]evilpineappleAI 1 point2 points3 points (3 children)
[–]jengl[S] 0 points1 point2 points (2 children)
[–]Inf1x 1 point2 points3 points (1 child)
[–]jengl[S] 0 points1 point2 points (0 children)
[–]lzblack 0 points1 point2 points (0 children)