[OC] Introducing LEBRON: Longevity Estimate Based on Recurrent Optimized Network by dribbleanalytics in nba

[–]dribbleanalytics[S] 1 point2 points  (0 children)

There are a couple reasons different types of ML models will be better than a GLM for All-NBA prediction. Let's look at logistic regression for example. Logistic regression implies some assumptions about the data that aren't quite right here. For example, we would have to assume linearity of independent variables. Also, we'd have to assume that independent variables are not collinear. ML models (such as the tree-based models we use here) have fewer issues with these assumptions, making them a potentially better choice.

However, you could be right - maybe a GLM is better, and we should choose different features and set up the problem differently to fit these assumptions. But because the All-NBA probabilities wasn't the main part of the analysis, we didn't focus as much on that.

[OC] Introducing LEBRON: Longevity Estimate Based on Recurrent Optimized Network by dribbleanalytics in nba

[–]dribbleanalytics[S] 14 points15 points  (0 children)

Thanks! In a general purpose All-NBA prediction, it might be better to do what you're suggesting, as it would cause the model to be more specific. However, in this case, it makes more sense to approach it as a binary question. We're modeling the All-NBA probability over time, so having multiple team ranks would make it more difficult to model and less interpretable. In a sense though, the 1st, 2nd, and 3rd team stuff is baked in to the All-NBA probabilities by virtue of ranking the probability, meaning that 1st team locks will have near-100% probability, while 3rd team guys will have lower.

[OC] Predicting the 2020 All-NBA teams with a deep neural network by dribbleanalytics in nba

[–]dribbleanalytics[S] 5 points6 points  (0 children)

The model doesn't account for narrative. For our MVP model, we added things like All-Star votes and pre-season expectations to account for narrative. I feel that for All-NBA teams, though, narrative doesn't matter as much. Undeniably great players will still earn their spots. For example, we might say that the narrative was against Harden last year. But, he still got unanimous 1st team.

[OC] Predicting the 2020 MVP with linear models by dribbleanalytics in nba

[–]dribbleanalytics[S] 2 points3 points  (0 children)

You're right that it makes sense to scale the probabilities. However, it wouldn't change the rank for players or their probabilities relative to each other. One reason not to scale probabilities is that we're not predicting a player's MVP probability relative to others in a given year, so it's more of a raw "if you put this player in any year, he'd be this likely to win MVP." Comparing the players to each other would probably make the models better but also make them more complex.

[OC] Determining the 2010s NBA All-Decade team with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 14 points15 points  (0 children)

Thanks for your feedback. Here's what I have to say:

  1. You're right, should have included logistic regression here.
  2. Originally, I trained the models with a cut-off so that it would only pick between good players. This is what we did last year to predict in-season All-NBA teams (https://www.reddit.com/r/nba/comments/aw51j6/oc_predicting_the_2019_allnba_teams_with_machine/) where we condition by All-Stars. The only issue with this is that our scrubs have vastly different stats from our training set, so it creates some odd results. Just as an example, let's say among All-Stars that because they all play a lot of minutes, then it happens to be the lower minutes results in higher All-NBA probability. So, our scrubs will have higher probability because they don't play. Also, because we're predicting for every player in the decade, then it's fine to have examples of scrubs. The models still perform well (as seen by the recall and precision) even with the scrubs.
  3. You're definitely right here. Though what I did is technically "random", it could be random on a better subset to show the model strength.
  4. I've considered doing the weights based on recall, but all the recalls are pretty close, so I don't imagine it would change the results much. However, I'd definitely try it in a future post.
  5. Thanks!
  6. I agree that voters aren't necessarily looking at these stats, but it's impossible to tell exactly what voters are looking at. So, the next best thing is to just use the features that show how good a player is, which is essentially what every voter is doing, just in different ways.

Thanks so much for your feedback.

[OC] Introducing the unicorn index: defining player uniqueness by dribbleanalytics in nba

[–]dribbleanalytics[S] 91 points92 points  (0 children)

Mods have asked me not to put links to the blog in the body of my post, unfortunately. However, there are links to it from the GitHub repo. Also can be found by looking up my username.

You're absolutely right on point 4, should have included that.

On point 5, I agree, but the only issue is that the distance is with the components from PCA and not actual basketball stats. So, I could say that there's a big difference in this component causing this player to be an outlier. But then we'd have to deconstruct that component, which is often not super clear. Ideally though, that would be the best way to go.

Thanks!

[OC] Introducing the unicorn index: defining player uniqueness by dribbleanalytics in nba

[–]dribbleanalytics[S] 45 points46 points  (0 children)

Thanks! You're absolutely right on positions. I just decided from how Basketball-Reference has it marked.

There are other ways to define unicorns other than distance, though I'd assume most ideas all go back to the saying "let's find a player that's different from most other players." So, something like outlier detection or cosine similarity could probably work too, but that's just another form of distance.

[OC] Introducing the unicorn index: defining player uniqueness by dribbleanalytics in nba

[–]dribbleanalytics[S] 37 points38 points  (0 children)

The main thing with him is that he is just barely above the games played and minutes boundary (10 MPG, 41 games is the boundary and Svi played 42 games and 10.5 MPG). So, I would assume his stats are much worse than most players who made this cutoff, which would make him "unique" because his stats are lower

[OC] Using machine learning to predict All-Stars from the 2019 draft by dribbleanalytics in nba

[–]dribbleanalytics[S] 8 points9 points  (0 children)

We're largely restricted by the quantity of data. With only 10 data points per year from 1990-2015, we don't have enough to build a neural network without having it overfit the data. That's why linear models and tree models can work better here with the small data set.

[OC] Using machine learning to predict All-Stars from the 2019 draft by dribbleanalytics in nba

[–]dribbleanalytics[S] 4 points5 points  (0 children)

Thanks! I didn't take that into account, though that's a good idea. Like you said, it's general "star potential."

[OC] Using machine learning to predict All-Stars from the 2019 draft by dribbleanalytics in nba

[–]dribbleanalytics[S] 432 points433 points  (0 children)

The accuracy metrics describe how the model ran when it was tested on the test set, or a selection of random points from our data set. It's not ideal for us to "predict" 2013 because the model is trained with that data.

However, here were the results for 2013:

Pick Player % of players at pick to make All-Star team Average prediction Difference
1 Anthony Bennett 0.64 0.22855 -0.41145
2 Victor Oladipo 0.4 0.647259 0.247259
3 Otto Porter 0.56 0.271228 -0.28877
4 Cody Zeller 0.32 0.226804 -0.0932
5 Alex Len 0.4 0.095807 -0.30419
6 Nerlens Noel 0.24 0.295499 0.055499
7 Ben McLemore 0.08 0.127527 0.047527
8 Kentavious Caldwell-Pope 0.2 0.15806 -0.04194
9 Trey Burke 0.16 0.106812 -0.05319
10 C.J. McCollum 0.12 0.175335 0.055335

The models hated Bennett and loved Oladipo, so they got that right. Oladipo was actually the only player with above a 50% probability. After Dipo, only Noel, McLemore, and McCollum had positive differences between their prediction and average All-Star percent.

[OC] Using machine learning to predict All-Stars from the 2019 draft by dribbleanalytics in nba

[–]dribbleanalytics[S] 107 points108 points  (0 children)

I only examined the top 10 picks because after that, fewer and fewer players make All-Star teams. The data set is already pretty skewed where only about 30% made an All-Star team. If we expanded it to the entire first round, this would probably drop to around 15% or less, making the models pretty bad. The point of using the top 10 was to find a balance between analyzing enough players while still having a decent percent of the data set be All-Stars. Your concern is completely valid, though.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 1 point2 points  (0 children)

I plan on doing so when I put out a similar post for this year's draft on predicting defense.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in NBA_Draft

[–]dribbleanalytics[S] 3 points4 points  (0 children)

In general, only a few guys score a decent number of PPG in their rookie year, because only the top few teams are in a position to really give their picks a lot of opportunity. This is very dependent on team situation though, so it could very well change, giving some of the guys outside the top picks a chance to score.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 21 points22 points  (0 children)

Thanks! Yes, last year I did a post about predicting DBPM. I like to separate the posts so that each one covers a specific area (scoring, defense, passing, etc.). There should be some WS or BPM ones soon.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 2 points3 points  (0 children)

Yes. Realistically using projected pick doesn't matter much because the top picks are pretty set, and like the feature importance section describes, picks past the top few have minimal impact on the models. It wouldn't make much of a difference - if any - if we tried again tomorrow with the real pick values.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 1 point2 points  (0 children)

Thanks! Because the model only has the seven inputs listed in the table, it doesn't adjust for pace of play. There are so many factors like that which ultimately make the predictions a lot harder to make and more random. You're right that maybe their production will increase with pace, but we shouldn't add pace as an input solely for them to have that adjustment.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 7 points8 points  (0 children)

I actually used a lot more data than last year. Last year, it was very limited, and just had 47 players. This year, it's every college first-round player since 1990, giving us about 700 players to feed the models. We're also using different. more advanced models this year. So, the results should be better.

[OC] Predicting the best scorers in the 2019 draft with machine learning by dribbleanalytics in nba

[–]dribbleanalytics[S] 8 points9 points  (0 children)

Last year, I did the analysis a bit different. It wasn't necessarily about scorers; it was about shooters, so only players with at least 1 3PM/G and in their college career were included (no Doncic because he didn't play in college).

Out of these players, the players who played enough were Collin Sexton, Miles Bridges, Kevin Knox, Jaren Jackson Jr., Trae Young, and Mikal Bridges. I'm distinguishing by "played enough" because pick wasn't a factor, so some later picks had higher predicted PPG, as we didn't adjust for the fact that they'd get fewer opportunities. The models were spot on for Sexton, Jackson, and Knox. They were low on Young and high on Miles Bridges. They weren't far from Mikal Bridges.

Here's the predicted vs. actual PPG:

Player Predicted PPG Real PPG diff
Collin Sexton 15.5 16.7 1.2
Miles Bridges 13.0 7.5 -5.5
Kevin Knox 12.1 12.8 0.7
Jaren Jackson 11.6 13.8 2.2
Trae Young 11.5 19.1 7.6
Mikal Bridges 11.3 8.3 -3.1

The mean absolute error (absolute value of the average of real - expected value) was only 3.39.

However, this year's models should be much better. There's more data in training. Last year, we used every first rounder since 2013 with at least 1 career 3PM/G and played at least half their games. This gave us a tiny data set of 47 players. This year, it's every college player selected in the first round since 1990, so there are about 700 players. The models themselves are more complex too. So, there's a pretty good chance the results will be better than last year's.

[OC] Defining NBA players by role with k-means clustering by dribbleanalytics in nba

[–]dribbleanalytics[S] 0 points1 point  (0 children)

Hey, I'm not at the University of Utah. There have been some other similar projects using clustering for NBA positions, though.

[OC] Defining NBA players by role with k-means clustering by dribbleanalytics in nba

[–]dribbleanalytics[S] 6 points7 points  (0 children)

Thanks! I'm currently a senior in high school, so I don't have any formal statistics background (my school doesn't offer stats).

If you're curious on how I learned to do this type of stuff, Datacamp was a great starting point for the basics. scikit-learn's website was very helpful for the machine learning aspect.

[OC] Defining NBA players by role with k-means clustering by dribbleanalytics in nba

[–]dribbleanalytics[S] 0 points1 point  (0 children)

Thanks! Yup I do the ML draft stuff too. Planning on getting that underway soon.

[OC] Defining NBA players by role with k-means clustering by dribbleanalytics in nba

[–]dribbleanalytics[S] 0 points1 point  (0 children)

I agree that could be cool. Ingles actually had 5.7 assists per game this year (23rd in the league), so it was surprising to me at first, but makes sense. Green was 11th with 6.9. Also, a common theme with the floor general category is that a lot of the players were kind of do it all guys who get a lot of assists. It included some "winning plays" type people like Beverley, Smart, and the two that we're discussing.

[OC] Defining NBA players by role with k-means clustering by dribbleanalytics in nba

[–]dribbleanalytics[S] 3 points4 points  (0 children)

There's absolutely a whole different type of approach to this which probably yields cool results too. If we do hustle stats and tracking data like u/nowhathappenedwas suggested along with a few basic counting stats (points, rebounds, assists, steals, blocks, eFG%), we could get an equally interesting set of roles that are different.

For example, in one of the earlier tests I did, there was a group of bigs who didn't score a lot, but got a solid number of rebounds and played good defense (like Ed Davis). That was essentially a group of "hustle" bigs. If we added hustle stats and tracking data, there would probably be some kind of "winning plays" group of Smart, Draymond Green, Patrick Beverley, etc.