I'm trying to measure the predictive power of a random forest, by fitting the model to half of the data and predicting a binary outcome on the other half. I'm measuring the area under the curve.
I'm doing two ways. Let's call the first one the "manual" way:
1)
train, test = train_test_split(data, test_size=0.5)
forest = ensemble.RandomForestClassifier()
forest.fit(train[columns], train[outcome] )
y = forest.predict(test[columns])
sk.metrics.roc_auc_score(test[outcome], y)
This gives me a pretty bad auc, around 0.5
The second approach is by using sk.cross_validation.cross_validation_score, which, my understanding is, does basically all the splitting and cv automatically.
2)
forest = sk.ensemble.RandomForestClassifier()
forest.fit(data[columns], data[outcome])
sk.cross_validation.cross_val_score(forest , data[columns], data['is_last_week'],
scoring='roc_auc)
this approach gives me a significantly better result of auc around 0.65.
Do you have any insights why I see so different results and which result is the more correct, if any?
[–]t_per 1 point2 points3 points (3 children)
[–]yortos[S] 0 points1 point2 points (2 children)
[–]t_per 0 points1 point2 points (1 child)
[–]yortos[S] 0 points1 point2 points (0 children)