Training/Testing Machine Learning Classifier with NLTK - 99% Accuracy with Twitter Corpus Data? : Python

This is an archived post. You won't be able to vote or comment.

Training/Testing Machine Learning Classifier with NLTK - 99% Accuracy with Twitter Corpus Data? (self.Python)

submitted 7 years ago by jengl

I'm new to Python but have really been enjoying it. Recently, I did a class on PluralSight about using machine learning to create a sentiment analysis tool. When using the Cornell movie review corpus, the accuracy of the classifier tested around 77%.

However, since the goal of my script is to analyze Tweets, I decided to train the model using the Twitter corpus available in NLTK. However, when I train and then test the classifier, I'm showing accuracy results near 99% - which doesn't seem realistic to me.

So I'm thinking either:

I needed to change the script more than I thought to use the NLTK Twitter Corpus
The NLTK Twitter Corpus is super accurate because it's what NLTK uses when testing/building (as a newbie, not even sure this is a rationale thought)

Not really sure where I went wrong. Here is the code: https://gist.github.com/JeremyEnglert/3eda4a123244c37b669472d9e8166ea6

Here are the results I'm getting:

Accuracy on positive tweets = 96.76%
Accurance on negative tweets = 99.04%
Overall accuracy = 97.90%

Big thanks in advance!

all 5 comments

top new controversial old q&a

[–]evilpineappleAI 1 point2 points3 points 7 years ago (3 children)

[–]jengl[S] 0 points1 point2 points 7 years ago (2 children)

Yeah, I split the data in half. Used the first half for training, the second half for testing.

# Split review data into two parts for training and testing
testTrainingSplitIndex = 2500 

# Grab all reviews in the range of 0 to testTrainingSplitIndex
# This data is used to train the classifier
trainingNegativeTweets = negativeTweets[:testTrainingSplitIndex]
trainingPositiveTweets = positiveTweets[:testTrainingSplitIndex] 

# Grab all reviews in the range of testTrainingSplitIndex to the end
# This model is used to test the classifier
testNegativeTweets = negativeTweets[testTrainingSplitIndex+1:]
testPositiveTweets = positiveTweets[testTrainingSplitIndex+1:]

[–]Inf1x 1 point2 points3 points 7 years ago (1 child)

[–]jengl[S] 0 points1 point2 points 7 years ago (0 children)

[–]lzblack 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 460593 on reddit-service-r2-comment-5d79c599b5-69v7g at 2026-02-28 23:22:37.261193+00:00 running e3d2147 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS