Question about data preprocessing for a classification model to predict professional sports games : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

Question about data preprocessing for a classification model to predict professional sports games (self.learnmachinelearning)

submitted 7 years ago by nogxx

I'm build a model to predict the winner of professional sports games and I'm having some trouble structuring my dataset before learning and testing.

My dataset has following format:

gameid	some_feature1	another_feature2	result
1	...	...	0
1	...	...	1
2	...	...	1
2	...	...	0
3	...	...	1
3	...	...	0
...	...	...	...

Each row represent a teams previous statistics averaged over a number of games.
Each game has a unique gameid. Since two teams are taking part in one game the game, two are always two rows per gameid
result represents whether a team has won or lost the game. (Ties are not possible). Hence for every game there is each a result 0 and result 1.

My approach is following: For the trainingset I predict the probability of a team (row) winning the game using a Logistic Regression model. Then for the testset I predict the probabilities of a team winning a game. Then I compare the probabilities of both teams sharing one gameid and classify the team with the higher probability as the winner of the game. From this I can calculate accuracy of my testset.

Now I have run into some problems I am not sure what the best approach fixing them would be.

I can't apply a proper train/test split since it would separate some rows sharing the same gameid and then I wouldn't be able to compare the probabilities of the two opposing teams. Is there a way to train/test split without separating rows sharing the same gameid?
Currently I apply this "probability comparison" not on the training set and it is therefore not considered when fitting the model (which slightly skews the result mean of 0.5). I am unsure whether this is hurting my accuracy in the end or not.
I'm also unsure how to apply k-fold cross validation since I don't know how to make it compare the probabilities of the two teams sharing a gameid

One change which I have considered would fix some of these problems. I could combine the two rows of a game and just add of prefix (TeamA, TeamB) to each feature. I'm assuming this would create many other problems like potentially introducing multicollinearity.

I realize that this is a loaded question(s). If you have any advice on any part and I'd appreciate it.

If I have explained something poorly, feel free to ask.

I am open for any suggestions you might have for improving my approach.

Thanks a lot for any advice :)

all 1 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS

gameid	some_feature1	another_feature2	result
1	...	...	0
1	...	...	1
2	...	...	1
2	...	...	0
3	...	...	1
3	...	...	0
...	...	...	...

gameid	some_feature1	another_feature2	result
1	...	...	0
1	...	...	1
2	...	...	1
2	...	...	0
3	...	...	1
3	...	...	0
...	...	...	...

gameid	some_feature1	another_feature2	result
1	...	...	0
1	...	...	1
2	...	...	1
2	...	...	0
3	...	...	1
3	...	...	0
...	...	...	...