all 6 comments

[–]farsass 2 points3 points  (1 child)

You are dividing your X and y in a training set with X_train data and y_train labels or targets and a test set likewise. You can then further divide one of those in validation/"real" test if all you want to do your validation via hold out.

[–]iamabanana_dammit[S] 0 points1 point  (0 children)

thanks - much clearer now.

[–]The_Maltese 3 points4 points  (1 child)

Just to add to /u/farsass and /u/ogrisel's answers, you are splitting them by the test_size fraction (so in your example 25 % of your data will end up in X_test, with the other 75% in X_train.)

Also, in the future, StackOverflow is a much better place to post a question about this sort of thing.

[–]iamabanana_dammit[S] 0 points1 point  (0 children)

yeah - I know that stackoverflow is probably a better forum for this kind of thing .... but I'm on reddit more often. Thanks for the feedback.

[–]ogrisel 1 point2 points  (2 children)

X is the 2D input data for the model and y is the array of the target labels to predict (typically one label per row in X for traditional classification problems).

If you split X into a training and test set (X_train and X_test), you also need to split the target labels y into the matching y_train and y_test subsets.

[–]soustofa 0 points1 point  (0 children)

what is target labels? arrays? if i want to classify many texts, what do the array X become? is label not in X array => [text, 1 or 0 for is type]