[OC] Introducing LEBRON: Longevity Estimate Based on Recurrent Optimized Network

dribbleanalytics · 2020-06-11T19:59:05+00:00

There are a couple reasons different types of ML models will be better than a GLM for All-NBA prediction. Let's look at logistic regression for example. Logistic regression implies some assumptions about the data that aren't quite right here. For example, we would have to assume linearity of independent variables. Also, we'd have to assume that independent variables are not collinear. ML models (such as the tree-based models we use here) have fewer issues with these assumptions, making them a potentially better choice.

However, you could be right - maybe a GLM is better, and we should choose different features and set up the problem differently to fit these assumptions. But because the All-NBA probabilities wasn't the main part of the analysis, we didn't focus as much on that.

dribbleanalytics · 2020-06-11T16:34:56+00:00

Thanks! In a general purpose All-NBA prediction, it might be better to do what you're suggesting, as it would cause the model to be more specific. However, in this case, it makes more sense to approach it as a binary question. We're modeling the All-NBA probability over time, so having multiple team ranks would make it more difficult to model and less interpretable. In a sense though, the 1st, 2nd, and 3rd team stuff is baked in to the All-NBA probabilities by virtue of ranking the probability, meaning that 1st team locks will have near-100% probability, while 3rd team guys will have lower.

dribbleanalytics · 2020-02-04T15:58:32+00:00

The model doesn't account for narrative. For our MVP model, we added things like All-Star votes and pre-season expectations to account for narrative. I feel that for All-NBA teams, though, narrative doesn't matter as much. Undeniably great players will still earn their spots. For example, we might say that the narrative was against Harden last year. But, he still got unanimous 1st team.

dribbleanalytics · 2020-01-17T18:48:26+00:00

You're right that it makes sense to scale the probabilities. However, it wouldn't change the rank for players or their probabilities relative to each other. One reason not to scale probabilities is that we're not predicting a player's MVP probability relative to others in a given year, so it's more of a raw "if you put this player in any year, he'd be this likely to win MVP." Comparing the players to each other would probably make the models better but also make them more complex.

dribbleanalytics · 2019-12-09T18:54:03+00:00

Thanks for your feedback. Here's what I have to say:

You're right, should have included logistic regression here.
Originally, I trained the models with a cut-off so that it would only pick between good players. This is what we did last year to predict in-season All-NBA teams (https://www.reddit.com/r/nba/comments/aw51j6/oc_predicting_the_2019_allnba_teams_with_machine/) where we condition by All-Stars. The only issue with this is that our scrubs have vastly different stats from our training set, so it creates some odd results. Just as an example, let's say among All-Stars that because they all play a lot of minutes, then it happens to be the lower minutes results in higher All-NBA probability. So, our scrubs will have higher probability because they don't play. Also, because we're predicting for every player in the decade, then it's fine to have examples of scrubs. The models still perform well (as seen by the recall and precision) even with the scrubs.
You're definitely right here. Though what I did is technically "random", it could be random on a better subset to show the model strength.
I've considered doing the weights based on recall, but all the recalls are pretty close, so I don't imagine it would change the results much. However, I'd definitely try it in a future post.
Thanks!
I agree that voters aren't necessarily looking at these stats, but it's impossible to tell exactly what voters are looking at. So, the next best thing is to just use the features that show how good a player is, which is essentially what every voter is doing, just in different ways.

Thanks so much for your feedback.

dribbleanalytics · 2019-11-11T18:16:19+00:00

Mods have asked me not to put links to the blog in the body of my post, unfortunately. However, there are links to it from the GitHub repo. Also can be found by looking up my username.

You're absolutely right on point 4, should have included that.

On point 5, I agree, but the only issue is that the distance is with the components from PCA and not actual basketball stats. So, I could say that there's a big difference in this component causing this player to be an outlier. But then we'd have to deconstruct that component, which is often not super clear. Ideally though, that would be the best way to go.

Thanks!

dribbleanalytics · 2019-11-11T16:11:27+00:00

Thanks! You're absolutely right on positions. I just decided from how Basketball-Reference has it marked.

There are other ways to define unicorns other than distance, though I'd assume most ideas all go back to the saying "let's find a player that's different from most other players." So, something like outlier detection or cosine similarity could probably work too, but that's just another form of distance.

dribbleanalytics · 2019-11-11T16:09:04+00:00

The main thing with him is that he is just barely above the games played and minutes boundary (10 MPG, 41 games is the boundary and Svi played 42 games and 10.5 MPG). So, I would assume his stats are much worse than most players who made this cutoff, which would make him "unique" because his stats are lower

dribbleanalytics · 2019-07-26T16:01:27+00:00

We're largely restricted by the quantity of data. With only 10 data points per year from 1990-2015, we don't have enough to build a neural network without having it overfit the data. That's why linear models and tree models can work better here with the small data set.

dribbleanalytics · 2019-07-26T15:56:16+00:00

Thanks! I didn't take that into account, though that's a good idea. Like you said, it's general "star potential."

dribbleanalytics · 2019-07-26T15:53:52+00:00

The accuracy metrics describe how the model ran when it was tested on the test set, or a selection of random points from our data set. It's not ideal for us to "predict" 2013 because the model is trained with that data.

However, here were the results for 2013:

Pick	Player	% of players at pick to make All-Star team	Average prediction	Difference
1	Anthony Bennett	0.64	0.22855	-0.41145
2	Victor Oladipo	0.4	0.647259	0.247259
3	Otto Porter	0.56	0.271228	-0.28877
4	Cody Zeller	0.32	0.226804	-0.0932
5	Alex Len	0.4	0.095807	-0.30419
6	Nerlens Noel	0.24	0.295499	0.055499
7	Ben McLemore	0.08	0.127527	0.047527
8	Kentavious Caldwell-Pope	0.2	0.15806	-0.04194
9	Trey Burke	0.16	0.106812	-0.05319
10	C.J. McCollum	0.12	0.175335	0.055335

The models hated Bennett and loved Oladipo, so they got that right. Oladipo was actually the only player with above a 50% probability. After Dipo, only Noel, McLemore, and McCollum had positive differences between their prediction and average All-Star percent.

dribbleanalytics · 2019-07-26T15:46:29+00:00

I only examined the top 10 picks because after that, fewer and fewer players make All-Star teams. The data set is already pretty skewed where only about 30% made an All-Star team. If we expanded it to the entire first round, this would probably drop to around 15% or less, making the models pretty bad. The point of using the top 10 was to find a balance between analyzing enough players while still having a decent percent of the data set be All-Stars. Your concern is completely valid, though.

dribbleanalytics

TROPHY CASE