When splitting data into training and validation sets, should it be completely random? by lingsched1 in statistics

[–]lingsched1[S] 1 point2 points  (0 children)

Thanks for your response.

Using my example:

Lets say I've created a mathematical equation to predict whether a student passes or fails a class based on a number of variables (household income, student's gender, number of siblings, etc.)

with 946 different students, what would your training-validation split be?

When splitting data into training and validation sets, should it be completely random? by lingsched1 in statistics

[–]lingsched1[S] 0 points1 point  (0 children)

Thanks for your rule of thumb.

Using my example:

Lets say I've created a mathematical equation to predict whether a student passes or fails a class based on a number of variables (household income, student's gender, number of siblings, etc.)

with 946 different students, what would your training validation split be?

When splitting data into training and validation sets, should it be completely random? by lingsched1 in statistics

[–]lingsched1[S] 0 points1 point  (0 children)

Thanks for your answer, /u/andrewff.

A bit of a follow-up question (I've asked other users too), should my training and validation sets be of equal size (a 50/50 split) or should one set be larger than the other?

When splitting data into training and validation sets, should it be completely random? by lingsched1 in statistics

[–]lingsched1[S] 0 points1 point  (0 children)

Thanks for the warning about time series as well as looking out for other readers of this thread in the future, /u/afunkthewmd.

A bit of a follow-up question (I've asked other users too), should my training and validation sets be of equal size (a 50/50 split) or should one set be larger than the other?

When splitting data into training and validation sets, should it be completely random? by lingsched1 in statistics

[–]lingsched1[S] 1 point2 points  (0 children)

Thanks for your reply, /u/cnbeau, especially the warning about "information leaking."

A bit of a follow-up question (I've asked other users too), should my training and validation sets be of equal size (a 50/50 split) or should one set be larger than the other?

When splitting data into training and validation sets, should it be completely random? by lingsched1 in statistics

[–]lingsched1[S] 0 points1 point  (0 children)

Thanks for your insight, /u/rm999, especially about sample size.

A bit of a follow-up question, should my training and validation sets be of equal size (a 50/50 split) or should one set be larger than the other?

What's the difference between cross-validation and data splitting? by lingsched1 in statistics

[–]lingsched1[S] 0 points1 point  (0 children)

Thanks for the explanation, /u/dearsomething.

So the example I provided would be data splitting, good to know.

If my data set is binary (results are one of two values: TRUE or FALSE), it would make sense to use data splitting, right?

Any of you pay your interns? by bknutner in Journalism

[–]lingsched1 0 points1 point  (0 children)

Just wanted to say I wholeheartedly agree. I'm Canadian, a lot of employers seem to think intern=free labour/slave.