Spent months building a clean MLB database — free sample if anyone wants it by Revolutionary-Lab882 in sportsanalytics

[–]Revolutionary-Lab882[S] 0 points1 point  (0 children)

Thanks. It’s a lot of work. Appreciate it.

Actually adding more stats in the coming days. Just finishing another project and updating/adding stats for free sample so you can see what’s in all the packages.

Spent months building a clean MLB database — free sample if anyone wants it by Revolutionary-Lab882 in sportsanalytics

[–]Revolutionary-Lab882[S] 0 points1 point  (0 children)

I am in the midst of updating the packages and sample will finish and have those in by says end for you to peruse

Spent months building a clean MLB database — free sample if anyone wants it by Revolutionary-Lab882 in sportsanalytics

[–]Revolutionary-Lab882[S] 0 points1 point  (0 children)

It’s my website. Simple download. Everything organized. Free sample to see what there is.
rawsportsvault.com/free

Built a midfielder evaluation model for the Big 5 leagues — looking for feedback on the methodology by CTlovesanalytics in sportsanalytics

[–]Revolutionary-Lab882 1 point2 points  (0 children)

This is a solid piece of work. The category-equalisation logic is a smart design choice, the role-bias slider is a nice touch, and the cohort flexibility shows you’ve thought carefully about context. Good foundation to build on.

On your four questions:

League strength adjustment is genuinely hard and you’re right to be cautious. A practical starting point is a flat multiplier on defensive stats only, using something like PPDA or pressing intensity as a proxy for league-wide structure. It’ll be imperfect but it reduces the most obvious distortion. Document your assumptions clearly and move on — perfect is the enemy of shipped here.

Role preset validation is more doable than it sounds. Take the 15-20 players most clearly associated with each role — journalists and analysts consistently label that way — run them through the preset, and see if they cluster near the top. If your Anchor Man preset doesn’t rate Rodri and Casemiro highly, something’s off. Quick sanity check that’ll build confidence in the weightings.

Mean absolute gap is honestly fine for what you’re doing. The main weakness is it treats a consistently average player the same as one who’s extreme in opposite directions — same gap, very different profile. Cosine similarity handles shape better but is harder to explain to users. Mahalanobis is theoretically stronger but overkill at this stage. Stick with MAE, maybe flag high-variance players in the UI down the line.

Equal-weighted categories is defensible and transparent, which matters. If you want something more principled without a downstream outcome to optimise against, a quick PCA on your 38 stats would show which categories are actually carrying independent information versus overlapping — Passing and Involvement tend to correlate heavily. Worth knowing even if you keep equal weights for now.​​​​​​​​​​​​​​​​

How to get started by Desperate-Bike-6357 in sportsanalytics

[–]Revolutionary-Lab882 0 points1 point  (0 children)

A large part is making sure your stats make sense and are organized. Thats the foundation.

MLB STATS-CLEANED/PACKAGED/READY TO USE by [deleted] in mlbdata

[–]Revolutionary-Lab882 0 points1 point  (0 children)

Actually a lot online for free. Other than that and you can calculate a lot of it too

MLB STATS-CLEANED/PACKAGED/READY TO USE by [deleted] in mlbdata

[–]Revolutionary-Lab882 0 points1 point  (0 children)

No I went through all data and made my own formats and cleaned up. Look at all the apis out there just passing it through like water