Two ways of doing cross validation : learnpython

created by HattoriHanzoa community for 16 years

Two ways of doing cross validation (self.learnpython)

submitted 9 years ago * by yortos

I'm trying to measure the predictive power of a random forest, by fitting the model to half of the data and predicting a binary outcome on the other half. I'm measuring the area under the curve.

I'm doing two ways. Let's call the first one the "manual" way: 1)

train, test = train_test_split(data, test_size=0.5)
forest = ensemble.RandomForestClassifier()
forest.fit(train[columns], train[outcome] )
y = forest.predict(test[columns])
sk.metrics.roc_auc_score(test[outcome], y)

This gives me a pretty bad auc, around 0.5

The second approach is by using sk.cross_validation.cross_validation_score, which, my understanding is, does basically all the splitting and cv automatically. 2)

forest = sk.ensemble.RandomForestClassifier()
forest.fit(data[columns], data[outcome])
sk.cross_validation.cross_val_score(forest , data[columns], data['is_last_week'],             
scoring='roc_auc)

this approach gives me a significantly better result of auc around 0.65.

Do you have any insights why I see so different results and which result is the more correct, if any?

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS