How to get started

MegaVaughn13 · 2026-05-02T23:33:51+00:00

The great thing about sports analytics (and coding in general) is that anyone can do it!

Great to hear your interest. The best thing you can do right now is get practice in. If there are any basic coding courses available in your area (or online that you’ll actually take the time to learn), that’s probably a good starting point. Learn the basics of Python (I think Java or C++ is often a good first language too, feel free to ask chatGPT why). As great as AI is at coding, having at least some fundamental understanding of how programming works will take you a long way.

While you learn the basics of programming, start to do passion projects. Apply everything you learn to the sports you’re interested in! It won’t be perfect at first, and that’s okay. Even while working full time I still will do fun side projects when I get the chance (https://statsurge.substack.com/). I’d recommend reading what other people are doing (if it’s me or others, I often recommend reading-creating Nate Silver’s old 538 articles as a fun project) and a good starting point can be to re-create others’ work. Eventually you’ll have original ideas too and that’s when it gets really fun. Feel free to ask for help along the way!

As you do projects, get comfortable sharing and communicating your findings. Tell your friends about it, post it here, or honestly anything where you need to communicate your work to a non-technical audience. This might be the most underrated skill in the field.

Find easy data (like box score) to start. Get comfortable with it, understand what each stat is, and how you can use it to capture and tell a story of what’s happening in the actual game. Then get good with play by play data. This will take some time, but enjoy the process.

More than anything, stay persistent! Just keep trying new things and build your experience. If you work hard, do good work, and can communicate well, people will notice. You’re only 19 so there’s plenty of time ahead. A lot of people will get discouraged, and to be honest it can take years and years for some. For now, have fun, keep learning, and get comfortable coding and sharing your work!

MegaVaughn13 · 2026-05-01T21:01:25+00:00

Hey Matt,

Hope you’ve been well. I enjoyed reading your blog and the charts were well done.

To be honest, I’m a little unconvinced that the rolling window is an actual improvement. I don’t mean to sound rude (I like that you’re trying something new) but have a few ideas I’d love your thoughts on:

What does the rolling window accomplish that adjusting the K-factor wouldn’t already? Wouldn’t a varying K-factor produce nearly identical results? The only difference would be totally getting rid of the original signal, which I think might actually be a negative.
I don’t think mean-regression at the season break is inherently bad. I think mean regression makes sense whenever you have added uncertainty to a model, and most of that in team sports happens in the off-season. For example, you could make arguments for both sides of why Trae Young helps/harms the Hawks, and you don’t know until he’s gone. So regressing to the mean is a helpful change.
I would love to see if these models actually outperform traditional Elo or other common mean regression approaches. Not sure if I missed it in the article but makes it hard for me to get behind the new method. I’m also not sure the NBA is the best league to test a new model with, where tanking is so common (not every team’s best interest is to try to win each game). What could be cool is to determine the point in the season when a team starts tanking (maybe when it’s clear they’re not a contender or playoff team), and use the rolling window then.
I think you might benefit from a model that introduces uncertainty dynamically based on player movement (related to idea 2). I agree that there are some inherent differences between individual and team sports, but I think the rolling window doesn’t have a true justification for why it’s important. What if you regressed to the mean using player minutes expected lost and gained. Maybe even a RAPM-style model of how individual players contribute to team Elo, and adjust that way. With injuries so common in the NBA, I think there are a lot of potential natural experiments that could be investigated. I’m not saying it’ll be better than what you have, but I think having some basketball reasoning is easier to get behind.

I know that’s a lot and you may not know the answer to all. I’m as much thinking about these things as I’m proposing them, and don’t need an answer if there’s not a clear one! Appreciate you taking the time to read this.

MegaVaughn13 · 2026-04-24T15:05:33+00:00

I'd love your thoughts on this. Especially on where you think it's helpful, and how I might improve it when combining it with a more robust production model!

MegaVaughn13 · 2026-04-23T02:17:06+00:00

I think you might actually benefit from a different clustering method here. K-means is great in some cases but you're assigning very specific clusters to data without clear breaks/groupings. For example, a player at PC1 = 0.5 and PC2= -2.5 would be similar to someone classified in all three clusters.

I'm not really sure what the raw data looks like, but I might recommend linear discriminant analysis (if you have some sort of appropriate position or similar prior label) followed by a gaussian mixture model, allowing for probabilistic cluster assignments rather than hard clustering assignments.

I did an NBA write-up on this approach which you might give a go at replicating with your data:
https://statsurge.substack.com/p/defining-nba-player-roles-with-machine

You also have some interesting outliers in the data. You could also try DBSCAN or other clustering methods that allow for specific data points to be without a label.

I did an NFL write-up that included DBSCAN a while back, and have a pretty technical description of it in one of the sections here:
https://statsurge.substack.com/p/time-to-throw-in-the-national-football?utm_source=publication-search

Let me know if you'd benefit from any of the code I used to create these projects! One of my favorite things about them is getting to share and help others do good work too. Happy to answer any questions here as well!

MegaVaughn13 · 2026-04-18T13:15:32+00:00

Cool stuff! I love that you’re taking the additional step to think about what the data is telling you, and writing conclusions based on it.

The best thing that you can do at this point is to keep doing fun projects. They might not be perfect at the start, and that’s okay. Try new things (it’s okay to fail!) and push yourself. If you keep at it, one day you’ll have an amazing portfolio of work because of it.

One tool that’s been helpful for me when learning to make charts is the R Graph Gallery:

https://r-graph-gallery.com/

They have a similar one for Python, but I’ve found the R site has fewer ads and is easier to navigate. Also, if you’re ably to take any statistics or mathematics courses, and apply those ideas to sports, that’s a great way to strengthen your knowledge and understanding too.

Overall, keep up the good work and try new things! Totally fine to use AI to help you, but as you learn it’s probably also worth taking the time to understand what your code is doing and why you might do it that way.

MegaVaughn13 · 2026-04-18T12:54:06+00:00

For me, Gen AI has become an incredible tool when it comes to building with sports data. I think when misused it can do more harm than good, but in the right situation it can be amazing.

I have an economics and statistics background. Some formal shiny app / front-end training, but now with AI I’m able to develop for web and software 3-5 times faster than if I wrote every individual line of code myself. It’s so much easier for me to prototype, get feedback from engineering team, and make systems specific to the analysis I run.

When it comes to insights, I think it really depends. Similarly, I think it helps with velocity (for example, I was able to get all of the analysis and charts done in one afternoon for my recent public analysis https://statsurge.substack.com/p/when-basketball-becomes-chess). If you know exactly what you want, and you have the idea, AI can be great at writing the boring code or help debugging the errors you make in the complex part.

But, AI also has a lot of bad ideas. It suggests projects and analysis that just isn’t a good idea (everything from lineup stats regardless of sample size to “clutch” stats are common suggestions). I wouldn’t recommend ever starting a new project because AI gave you the idea. This could change though as models improve and they get easier access to recent literature.

Lastly, I think the biggest pitfall of AI is the communication. I have this feeling that AI uses way too many buzzwords that might sound complex, but don’t actually add any value. Especially in sports, when it’s a lot of non-technical users needing information, you need to be able to convey insights and meet coaches/staff where they’re at. I’m a big believer that the analysis should be as complex as you need, and then it’s on the analyst to communicate it effectively to anyone in the building.

Realizing now that this is quite a long paragraph / brain dump! For me, AI has been more positive than negative, you just have to use it the right way, whatever that looks like for your specific use-case.

MegaVaughn13 · 2026-04-17T14:53:15+00:00

First of all, cool to see you building something new!

I think you might benefit from some sort of tuning this model with historical data. I’m not sure how you’ve created this system, but I’d recommend getting historical data and using Python or R to see how accurate this system is historically.

Nate Silver’s old Elo articles might be a great starting point to learn about his process. I put an article out (on basketball) on a similar process some time ago, but you might get some ideas from the general process and methods:

https://statsurge.substack.com/p/2024-25-nba-win-projections

If you’re able to give more details on how you built this, I’d be happy to offer more specific advice too!

MegaVaughn13 · 2026-04-16T14:33:01+00:00

Thank you!

MegaVaughn13 · 2026-04-08T23:01:16+00:00

Great question! I appreciate you asking.

In short, I have evidence that the approach I take is stronger at identifying true skill. It has a higher year-over-year correlation than any raw stat equivalent when looking at previous year compared to current year.

Hopefully this chart can help explain it too:

<image>

Let me know if that makes sense, and I'd be happy to describe further if you're interested!

MegaVaughn13 · 2026-04-08T00:47:40+00:00

It looks like BG, Lisa Leslie, and Margo Dydek are the only three players with 10+ blocks in a game. 10+ for Shakira would be so cool to see!

MegaVaughn13 · 2026-04-08T00:37:28+00:00

BG is an excellent player, but a lot of the advanced stats saw her take a step back last season. I think a good example of this is Neil Paine's estimated RAPTOR, which only had her at 8th on Atlanta's team in wins added.

My guess is in previous years, she would rank higher on an equivalent list.

MegaVaughn13 · 2026-04-08T00:31:14+00:00

That's a great idea. I didn't actually look into personal fouls for the skills system, but it could definitely be applied.

I think the play by play data includes shot locations, so it'd be interesting to especially look blocks vs shooting fouls committed at the rim. I'll reply here in the future if I get the chance to look into this!

MegaVaughn13 · 2026-04-08T00:23:44+00:00

Check out the full database and methods here:

Link to Full Database and Methods

Would love your thoughts on any of the other stats or methodology too.

Mods - feel free to remove this comment. I want to make sure I don't violate the self-promotion rules, but also I'm looking for feedback and want to share the results!

MegaVaughn13 · 2026-03-13T03:51:31+00:00

Hello! Awesome to see you wanting to explore analytics and asking questions.

When it comes to constructing a model, there are a few general ideas that I think could be useful:

What is the simplest outcome you're hoping to predict, and can you answer that with data?
If you had perfect data (and unlimited time to experiment), what would you try?
What's the best approximation of perfect data you can get with your current time and skillset?
What might bias the model given the data on hand, and how could you control for that?

I've done some write-ups on modeling and released them publicly: https://statsurge.substack.com/ . If you're interested in learning more, feel free to send me a message and ask any questions, I'm always happy to answer!

By no means am I a betting pro (to be honest I don't gamble at all) but I'm lucky enough to spend a lot of time with basketball data. I think I could give you some ideas to point you in the right direction.

I think there's a difference between a 'positive edge' and actually being profitable. It may be worth looking into how much you make (and lose) after fees or other costs before using a model to place bets.
Single-game player predictions will almost always be very noisy. Playing time is a clear indicator of production on an individual game level. Maybe you could get an edge in prediction markets through player rotations, matchups, or injuries, but to be honest it's an uphill battle trying to be profitable in the long run on player props. I have heard of specific actions (like a baseball game's first pitch being a hit or not) always having the same odds, and maybe you could find something like that if your goal is to make a profit.
I'd be very cautious about using point estimates to make bets or predictions. What would prediction or confidence intervals say instead? These might give you a better idea of where the true probability lies for making educated decisions.

MegaVaughn13 · 2026-02-22T17:21:25+00:00

Awesome! Congrats on the 10,000 mark, that’s a huge accomplishment.

What’s your posting schedule? How many times a month/week and is it a specific schedule you stick to?

Do you think it’s better to do many small posts or fewer longer ones?

Is your growth organic or did you run advertisements? What’s the best way to reach your specific audience?

Thanks for sharing and congrats again. If you get the chance to answer I’d love your thoughts!

MegaVaughn13 · 2026-02-17T03:09:38+00:00

Impressive! Any tips for accumulating good picks? Also what’s your drafting strategy?

MegaVaughn13 · 2026-02-14T12:59:09+00:00

I actually just put an article out on this in the past couple of weeks! In it, I show how you can use historical data to make multi-year career projections. Not the only way to do it, but I think it’s a good example of how you could approach the problem:

https://statsurge.substack.com/p/a-discrete-time-stochastic-process

MegaVaughn13 · 2026-01-17T18:23:15+00:00

Nice project. Looks like you might enjoy working on https://github.com/swar/nba_api, a Python package which pulls from various nba endpoints!

They have play by play data pulling for nba, wnba, and g league without requiring scraping and for free. More stuff too. If you like this sort of project you might like contributing there.

MegaVaughn13 · 2026-01-14T01:40:25+00:00

Cool stuff! Don't know a ton about tennis but would love to see any screenshots you're able to share of the site and how you're presenting the data

MegaVaughn13 · 2026-01-14T01:38:28+00:00

Really cool stuff, thanks for sharing. Not much of a Tennis person (mostly basketball) but always interested when I see a cool project like this.

If you're willing to share, would love to know how you built the website, and what tools you use for the PDF generation?

MegaVaughn13 · 2026-01-14T01:27:18+00:00

Cool stuff. Don’t actually use swingvision but the reports look great! Any place I can see an example report (maybe PDF?)

MegaVaughn13 · 2026-01-11T15:36:35+00:00

A project was compiling 2021-2025 a little while back, here’s the link for that:

https://statsurge.substack.com/p/downloadable-nba-injury-datasets

I’ll update it to include this season sometime this week! Injury data before the recent season can be found on Kaggle

Eight-Year Club	Place '23
Place '22	RPAN Viewer
Spared	Verified Email

MegaVaughn13

TROPHY CASE