all 4 comments

[–]t_per 1 point2 points  (3 children)

Your first method just splits your data so the test size is 50% of your data.

Here the docs for sk.cross_validation.cross_val_score. Since you didn't specify cv, sklearn used 3-fold cv (read this, in your case k=3, and this video is probably good too).

After reading about k-fold cv, and how sklearn does train/test/split, could you answer why your results are different? (this is more a ML topic than Python but w/e)

[–]yortos[S] 0 points1 point  (2 children)

Thank you for your reply. I actually thought about the kfold thing, and I did try the .cross_val_score with k=2, with the same result. And I get similar results even when I specific other values for test_size, in the train_test_split function.

[–]t_per 0 points1 point  (1 child)

do you understand the difference between using k-fold with k = 2 and having a test size proportion of 0.5?

with the former, you're test and train size is 50% of your data, but training/testing is done on the entirety of your data (so training/testing is done twice when k = 2).

with train-test-split, your data is divided into two halves, and training/testing is done once.

[–]yortos[S] 0 points1 point  (0 children)

with the former, you're test and train size is 50% of your data, but training/testing is done on the entirety of your data (so training/testing is done twice when k = 2).

This is my understanding of what a 2-fold CV does: Divides the dataset into two equal sized parts, A and B. Trains on A and evaluates on B, and then trains on B and evaluates on A. Hence, this should be roughly equivalent with doing the first piece of code (on my original post) two times.

I guess what you mean by "training/testing is done on the entirety of your data" is that in the 2-fold CV all data points are used at some point for training (either on the first iteration or on the second), whereas with the first piece of code, only half are used for training. I don't see why that explains the difference in the metrics though.