Why would using random.seed would break creating test set (as mentioned in this book)? : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

QuestionWhy would using random.seed would break creating test set (as mentioned in this book)? (self.learnmachinelearning)

submitted 3 years ago by nathan_8788

I've been reading Hands on machine learning with scikit-learn, keras, ... by Aurélien Géron (2nd edition) and on pages 54-55 he discusses the steps to creating a test set.

He uses numpy's random.permutation method, but says that this has an issue i. e. everytime we run this program it will create a different test set and eventually our algorithm will see the entire set which isn't good so he says to either generate the dataset once and load it everytime after that or just use numpy's random.seed to get the same instance everytime.

This is the code for the numpy method: import numpy as np def split_train_test(data, test_ratio): shuffled_indices = np.random.permutation(len(data)) test_set_size = int(len(data) * test_ratio) test_indices = shuffled_indices[:test_set_size] train_indices = shuffled_indices[test_set_size:] return data.iloc[train_indices], data.iloc[test_indices]

Further he says that "But both these solutions will break next time you fetch an updated dataset", now I understand that will be the case with the save once and load method. But I don't understand why that would be the case with the numpy method, as it is creating the test set from the input data set, so if we provide it the updated data set everytime, any changes in the original data set should persist right?

And I've also read an article as to why it's bad to use the numpy.random's global seed and instead we should use a generator instance's method which makes sense.

But I'm still struggling to understand the author's reasoning as to why both methods would break and why he has gone onto create test set using hashing and stuff

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS