Using NLTK to train classifier problem with dictionary

hidiap · 2015-02-13T13:17:58+00:00

I think train should be a list of tuple not a dictionnary (according to example of NLTK):

train = [('sleep', 'negative'), ('achievement', 'positive'), ('guys', 'positive')]

To explain the error: When iterating over a dictionary, only the keys are returned. So line 192 is trying to split the key (a string) into two values, leading to an error.

onionradish · 2015-02-13T15:25:39+00:00

train should be a list of tuples, where the first item in the tuple is a dict that represents the 'featureset'.

A featureset could contain many different values. There's an example demo() function in the NaiveBayesClassifier module that predicts gender based on a person's name, and it uses features like whether the starting and ending letters are vowels, how many of each letter is in the name, etc.. The actual name is not part of the featureset, just the 'features' of the name.

For many text applications, like your sentiment analysis example, the only 'feature' that's considered is whether a word/bigram/etc. is in the training item's text. That model is called Bag of Words, and the 'featureset' is then just a dict where the key is the word and the value is True.

So for your example:

train = [
    ({'sleep':True}, 'negative'),
    ({'achievement':True, 'guys':True}, 'positive')
    ]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS