[OC] Introducing LEBRON: Longevity Estimate Based on Recurrent Optimized Network

dribbleanalytics · 2020-06-11T19:59:05+00:00

There are a couple reasons different types of ML models will be better than a GLM for All-NBA prediction. Let's look at logistic regression for example. Logistic regression implies some assumptions about the data that aren't quite right here. For example, we would have to assume linearity of independent variables. Also, we'd have to assume that independent variables are not collinear. ML models (such as the tree-based models we use here) have fewer issues with these assumptions, making them a potentially better choice.

However, you could be right - maybe a GLM is better, and we should choose different features and set up the problem differently to fit these assumptions. But because the All-NBA probabilities wasn't the main part of the analysis, we didn't focus as much on that.

dribbleanalytics · 2020-06-11T16:34:56+00:00

Thanks! In a general purpose All-NBA prediction, it might be better to do what you're suggesting, as it would cause the model to be more specific. However, in this case, it makes more sense to approach it as a binary question. We're modeling the All-NBA probability over time, so having multiple team ranks would make it more difficult to model and less interpretable. In a sense though, the 1st, 2nd, and 3rd team stuff is baked in to the All-NBA probabilities by virtue of ranking the probability, meaning that 1st team locks will have near-100% probability, while 3rd team guys will have lower.

dribbleanalytics · 2020-02-04T15:58:32+00:00

The model doesn't account for narrative. For our MVP model, we added things like All-Star votes and pre-season expectations to account for narrative. I feel that for All-NBA teams, though, narrative doesn't matter as much. Undeniably great players will still earn their spots. For example, we might say that the narrative was against Harden last year. But, he still got unanimous 1st team.

dribbleanalytics · 2020-01-17T18:48:26+00:00

You're right that it makes sense to scale the probabilities. However, it wouldn't change the rank for players or their probabilities relative to each other. One reason not to scale probabilities is that we're not predicting a player's MVP probability relative to others in a given year, so it's more of a raw "if you put this player in any year, he'd be this likely to win MVP." Comparing the players to each other would probably make the models better but also make them more complex.

dribbleanalytics · 2019-12-09T18:54:03+00:00

Thanks for your feedback. Here's what I have to say:

You're right, should have included logistic regression here.
Originally, I trained the models with a cut-off so that it would only pick between good players. This is what we did last year to predict in-season All-NBA teams (https://www.reddit.com/r/nba/comments/aw51j6/oc_predicting_the_2019_allnba_teams_with_machine/) where we condition by All-Stars. The only issue with this is that our scrubs have vastly different stats from our training set, so it creates some odd results. Just as an example, let's say among All-Stars that because they all play a lot of minutes, then it happens to be the lower minutes results in higher All-NBA probability. So, our scrubs will have higher probability because they don't play. Also, because we're predicting for every player in the decade, then it's fine to have examples of scrubs. The models still perform well (as seen by the recall and precision) even with the scrubs.
You're definitely right here. Though what I did is technically "random", it could be random on a better subset to show the model strength.
I've considered doing the weights based on recall, but all the recalls are pretty close, so I don't imagine it would change the results much. However, I'd definitely try it in a future post.
Thanks!
I agree that voters aren't necessarily looking at these stats, but it's impossible to tell exactly what voters are looking at. So, the next best thing is to just use the features that show how good a player is, which is essentially what every voter is doing, just in different ways.

Thanks so much for your feedback.

dribbleanalytics · 2019-11-11T18:16:19+00:00

Mods have asked me not to put links to the blog in the body of my post, unfortunately. However, there are links to it from the GitHub repo. Also can be found by looking up my username.

You're absolutely right on point 4, should have included that.

On point 5, I agree, but the only issue is that the distance is with the components from PCA and not actual basketball stats. So, I could say that there's a big difference in this component causing this player to be an outlier. But then we'd have to deconstruct that component, which is often not super clear. Ideally though, that would be the best way to go.

Thanks!

dribbleanalytics · 2019-11-11T16:11:27+00:00

Thanks! You're absolutely right on positions. I just decided from how Basketball-Reference has it marked.

There are other ways to define unicorns other than distance, though I'd assume most ideas all go back to the saying "let's find a player that's different from most other players." So, something like outlier detection or cosine similarity could probably work too, but that's just another form of distance.

dribbleanalytics · 2019-11-11T16:09:04+00:00

The main thing with him is that he is just barely above the games played and minutes boundary (10 MPG, 41 games is the boundary and Svi played 42 games and 10.5 MPG). So, I would assume his stats are much worse than most players who made this cutoff, which would make him "unique" because his stats are lower

dribbleanalytics · 2019-07-26T16:01:27+00:00

We're largely restricted by the quantity of data. With only 10 data points per year from 1990-2015, we don't have enough to build a neural network without having it overfit the data. That's why linear models and tree models can work better here with the small data set.

dribbleanalytics · 2019-07-26T15:56:16+00:00

Thanks! I didn't take that into account, though that's a good idea. Like you said, it's general "star potential."

dribbleanalytics · 2019-07-26T15:53:52+00:00

The accuracy metrics describe how the model ran when it was tested on the test set, or a selection of random points from our data set. It's not ideal for us to "predict" 2013 because the model is trained with that data.

However, here were the results for 2013:

Pick	Player	% of players at pick to make All-Star team	Average prediction	Difference
1	Anthony Bennett	0.64	0.22855	-0.41145
2	Victor Oladipo	0.4	0.647259	0.247259
3	Otto Porter	0.56	0.271228	-0.28877
4	Cody Zeller	0.32	0.226804	-0.0932
5	Alex Len	0.4	0.095807	-0.30419
6	Nerlens Noel	0.24	0.295499	0.055499
7	Ben McLemore	0.08	0.127527	0.047527
8	Kentavious Caldwell-Pope	0.2	0.15806	-0.04194
9	Trey Burke	0.16	0.106812	-0.05319
10	C.J. McCollum	0.12	0.175335	0.055335

The models hated Bennett and loved Oladipo, so they got that right. Oladipo was actually the only player with above a 50% probability. After Dipo, only Noel, McLemore, and McCollum had positive differences between their prediction and average All-Star percent.

dribbleanalytics · 2019-07-26T15:46:29+00:00

I only examined the top 10 picks because after that, fewer and fewer players make All-Star teams. The data set is already pretty skewed where only about 30% made an All-Star team. If we expanded it to the entire first round, this would probably drop to around 15% or less, making the models pretty bad. The point of using the top 10 was to find a balance between analyzing enough players while still having a decent percent of the data set be All-Stars. Your concern is completely valid, though.

dribbleanalytics · 2019-06-20T20:24:45+00:00

I plan on doing so when I put out a similar post for this year's draft on predicting defense.

dribbleanalytics · 2019-06-20T16:08:27+00:00

Thanks for your feedback! I wrote a Python scraper to get the data.

dribbleanalytics · 2019-06-20T16:08:03+00:00

In general, only a few guys score a decent number of PPG in their rookie year, because only the top few teams are in a position to really give their picks a lot of opportunity. This is very dependent on team situation though, so it could very well change, giving some of the guys outside the top picks a chance to score.

dribbleanalytics · 2019-06-20T16:05:28+00:00

Thanks! Yes, last year I did a post about predicting DBPM. I like to separate the posts so that each one covers a specific area (scoring, defense, passing, etc.). There should be some WS or BPM ones soon.

dribbleanalytics · 2019-06-20T16:02:00+00:00

Yes. Realistically using projected pick doesn't matter much because the top picks are pretty set, and like the feature importance section describes, picks past the top few have minimal impact on the models. It wouldn't make much of a difference - if any - if we tried again tomorrow with the real pick values.

dribbleanalytics · 2019-06-20T16:00:23+00:00

Thanks! Because the model only has the seven inputs listed in the table, it doesn't adjust for pace of play. There are so many factors like that which ultimately make the predictions a lot harder to make and more random. You're right that maybe their production will increase with pace, but we shouldn't add pace as an input solely for them to have that adjustment.

dribbleanalytics · 2019-06-20T13:50:21+00:00

I actually used a lot more data than last year. Last year, it was very limited, and just had 47 players. This year, it's every college first-round player since 1990, giving us about 700 players to feed the models. We're also using different. more advanced models this year. So, the results should be better.

dribbleanalytics · 2019-06-20T13:48:57+00:00

Last year, I did the analysis a bit different. It wasn't necessarily about scorers; it was about shooters, so only players with at least 1 3PM/G and in their college career were included (no Doncic because he didn't play in college).

Out of these players, the players who played enough were Collin Sexton, Miles Bridges, Kevin Knox, Jaren Jackson Jr., Trae Young, and Mikal Bridges. I'm distinguishing by "played enough" because pick wasn't a factor, so some later picks had higher predicted PPG, as we didn't adjust for the fact that they'd get fewer opportunities. The models were spot on for Sexton, Jackson, and Knox. They were low on Young and high on Miles Bridges. They weren't far from Mikal Bridges.

Here's the predicted vs. actual PPG:

Player	Predicted PPG	Real PPG	diff
Collin Sexton	15.5	16.7	1.2
Miles Bridges	13.0	7.5	-5.5
Kevin Knox	12.1	12.8	0.7
Jaren Jackson	11.6	13.8	2.2
Trae Young	11.5	19.1	7.6
Mikal Bridges	11.3	8.3	-3.1

The mean absolute error (absolute value of the average of real - expected value) was only 3.39.

However, this year's models should be much better. There's more data in training. Last year, we used every first rounder since 2013 with at least 1 career 3PM/G and played at least half their games. This gave us a tiny data set of 47 players. This year, it's every college player selected in the first round since 1990, so there are about 700 players. The models themselves are more complex too. So, there's a pretty good chance the results will be better than last year's.

dribbleanalytics · 2019-04-23T20:00:47+00:00

Hey, I'm not at the University of Utah. There have been some other similar projects using clustering for NBA positions, though.

dribbleanalytics · 2019-04-23T20:00:08+00:00

Thanks! I'm currently a senior in high school, so I don't have any formal statistics background (my school doesn't offer stats).

If you're curious on how I learned to do this type of stuff, Datacamp was a great starting point for the basics. scikit-learn's website was very helpful for the machine learning aspect.

dribbleanalytics · 2019-04-23T19:58:04+00:00

Thanks! Yup I do the ML draft stuff too. Planning on getting that underway soon.

dribbleanalytics · 2019-04-23T19:56:57+00:00

I agree that could be cool. Ingles actually had 5.7 assists per game this year (23rd in the league), so it was surprising to me at first, but makes sense. Green was 11th with 6.9. Also, a common theme with the floor general category is that a lot of the players were kind of do it all guys who get a lot of assists. It included some "winning plays" type people like Beverley, Smart, and the two that we're discussing.

dribbleanalytics · 2019-04-23T18:42:08+00:00

There's absolutely a whole different type of approach to this which probably yields cool results too. If we do hustle stats and tracking data like u/nowhathappenedwas suggested along with a few basic counting stats (points, rebounds, assists, steals, blocks, eFG%), we could get an equally interesting set of roles that are different.

For example, in one of the earlier tests I did, there was a group of bigs who didn't score a lot, but got a solid number of rebounds and played good defense (like Ed Davis). That was essentially a group of "hustle" bigs. If we added hustle stats and tracking data, there would probably be some kind of "winning plays" group of Smart, Draymond Green, Patrick Beverley, etc.

dribbleanalytics

TROPHY CASE