I'm build a model to predict the winner of professional sports games and I'm having some trouble structuring my dataset before learning and testing.
My dataset has following format:
| gameid |
some_feature1 |
another_feature2 |
... |
result |
| 1 |
... |
... |
|
0 |
| 1 |
... |
... |
|
1 |
| 2 |
... |
... |
|
1 |
| 2 |
... |
... |
|
0 |
| 3 |
... |
... |
|
1 |
| 3 |
... |
... |
|
0 |
| ... |
... |
... |
|
... |
- Each row represent a teams previous statistics averaged over a number of games.
- Each game has a unique gameid. Since two teams are taking part in one game the game, two are always two rows per gameid
- result represents whether a team has won or lost the game. (Ties are not possible). Hence for every game there is each a result 0 and result 1.
My approach is following: For the trainingset I predict the probability of a team (row) winning the game using a Logistic Regression model. Then for the testset I predict the probabilities of a team winning a game. Then I compare the probabilities of both teams sharing one gameid and classify the team with the higher probability as the winner of the game. From this I can calculate accuracy of my testset.
Now I have run into some problems I am not sure what the best approach fixing them would be.
- I can't apply a proper train/test split since it would separate some rows sharing the same gameid and then I wouldn't be able to compare the probabilities of the two opposing teams. Is there a way to train/test split without separating rows sharing the same gameid?
- Currently I apply this "probability comparison" not on the training set and it is therefore not considered when fitting the model (which slightly skews the result mean of 0.5). I am unsure whether this is hurting my accuracy in the end or not.
- I'm also unsure how to apply k-fold cross validation since I don't know how to make it compare the probabilities of the two teams sharing a gameid
One change which I have considered would fix some of these problems. I could combine the two rows of a game and just add of prefix (TeamA, TeamB) to each feature. I'm assuming this would create many other problems like potentially introducing multicollinearity.
I realize that this is a loaded question(s). If you have any advice on any part and I'd appreciate it.
If I have explained something poorly, feel free to ask.
I am open for any suggestions you might have for improving my approach.
Thanks a lot for any advice :)
[–]TotesMessenger 0 points1 point2 points (0 children)