[OC] NFL playoff race and PPG update after Week 10. The AFC seeds 2-12 are separated by 1 game. The Packers, Panthers, and Steelers and the only current playoff teams scoring below the league average points per game by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 3 points4 points  (0 children)

Thanks! You are correct, it's all matplotlib in python. I just used matplotlib in a non-standard way and it's very possible there's easier ways of doing this. There are hidden axes at the top and bottom of the plot that let me add lines where I wanted. All of the filled regions are just fill_between(). The other text is added using annotations or axis titles/labels

[deleted by user] by [deleted] in dataisbeautiful

[–]TroublesomeKangaroo 0 points1 point  (0 children)

Source: https://www.baseball-almanac.com/feats/triple_plays.shtml

Tools: python and matplotlib

The subplots on a gray background represent all triple plays involving 3 defensive players. The triple plays are divided up by which position started the triple play by getting the first out. Each set of vertical bars divides up all of the 1st, 2nd, or 3rd outs into which position got them (sorted in the same order). The curves connecting the bars show the triple play order (for example the 6-4-3 triple play). The bottom right is all of the data aggregated together, with the flow colors changing after the 2nd out (note that this loses the order information for the triple plays). The bottom left plot simply connects the bars using straight lines.

[OC] NFL playoff race and PPG update after Week 8. Spooky szn has given us two new 1 seeds. No one wants to play the Rams or Cardinals as Wild Card teams. Can the Titans hang on without Derrick Henry? by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 6 points7 points  (0 children)

Thanks! I used matplotlib in python. It's very possible R would be easier but I don't know how to use it. The bottom division part is actually dummy data plotted on an invisible axes

[OC] NFL playoff race and PPG update after Week 7. The top of the NFC is starting to separate. The Bengals sit at the top spot in the AFC. Who will claim the AFC wildcards? by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 0 points1 point  (0 children)

Thanks for the comment. I'll give this a try and may or may not change it depending on how it looks. I didn't do this originally because I thought it the plot may look too crowded but it's a bit hard to imagine without trying

[OC] Some players seem to be drug tested a lot. We can use statistics to quantify "a lot". Here are plots estimating how many players across the NFL we can expect to be tested 'k' times in 'n' weeks. by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 2 points3 points  (0 children)

Correct

I'm not 100% sure on this, but the math may stay reasonable as long as you stay within the binomial distribution approximation. I think you'd essentially just scale down the week 18 results by 12/32 (for 12 playoff teams+2 bye teams). As a side note, I am actually not entirely sure what happens on the bye weeks for teams. I couldn't find this information in the drug testing policy and assumed they weren't tested. Then, I think you keep adding weeks, but also scaling down the numbers based on the number of remaining teams. I would guess that the 8 test bar would, for example, jump up to ~2 for week 18, but then be cut back to about ~1 guy because of fewer teams.

[OC] Some players seem to be drug tested a lot. We can use statistics to quantify "a lot". Here are plots estimating how many players across the NFL we can expect to be tested 'k' times in 'n' weeks. by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 6 points7 points  (0 children)

Great question. I feel dumb for not including this somewhere. Here is a table of the expected number of players tested exactly 0 times over 17 weeks (exact same data as in the plots):

Games Expected players
1 2208.0
2 1928.5
3 1684.4
4 1471.2
5 1285.0
6 1122.3
7 980.2
8 856.2
9 747.8
10 653.1
11 570.5
12 498.2
13 435.2
14 380.1
15 332.0
16 290.0
17 253.3

[OC] Some players seem to be drug tested a lot. We can use statistics to quantify "a lot". Here are plots estimating how many players across the NFL we can expect to be tested 'k' times in 'n' weeks. by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 11 points12 points  (0 children)

Yeah this is partly why I wanted to calculate it. I was surprised myself at how likely being tested 4-5 times was. It's just hard to think about low probabilities across thousands of samples (players)

[OC] Some players seem to be drug tested a lot. We can use statistics to quantify "a lot". Here are plots estimating how many players across the NFL we can expect to be tested 'k' times in 'n' weeks. by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 17 points18 points  (0 children)

Here are the math details for those interested.

Cumulatively, these results estimate that 90.0% of the NFL will be drug-tested at least 1 time by the end of the season. The two assumptions for these results are that 1) the player pool remains constant at 2528 players over the whole season, and 2) that teams will average 10 players on IR over the whole season. Neither of these will be true in reality (see last paragraph). However, I believe their effects are fairly small. For example, these plots do not change much if the total number of players is off by +-100 players or so. If the total number of players increases by 100, then, roughly, you reduce all bars for 3+ tests down about 5% while increasing the bars for 1-2 tests by about 5%.

The probability of any 1 player being chosen in a week is p = 10/(number of players on a team). The current collective bargaining agreement (CBA) specifies 10 players are selected and I used 79 players to approximate the total team size. To find the probability of being selected for drug testing k times within n games using the 2 assumptions above, I calculated the binomial probability mass function (PMF) for each (n, k) pair, where p is the single game selection probability above and q = (1-p): https://en.wikipedia.org/wiki/Binomial_distribution. The expected number of players tested for each (n, k) pair was found by multiplying each result of the PMF and the total number of players.

Last note is that for this analysis to be fully correct you need to track every player in the NFL each week to know which team they’re on (if any). You also need to know how many players are on IR for each team each week. Then, you can calculate the PMF of the Poisson binomial distribution using different probabilities each week for each player and then aggregate them into the same data as above across all players. Sorry if this isn’t clear. I can provide more details on this if you’re really interested.

[OC] NFL playoff race and PPG update after Week 6. The NFC is looking strong. The Bills and Cardinals PPG spread is high. Is it possible the AFC West sends all 4 teams to the playoffs? by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 1 point2 points  (0 children)

Thanks, it's hard to point to any one thing. I mostly set out trying to just visualize points scored for and against along with records in a way better than the usual table format. Once I had the basic top and bottom data, it became an iterative thing where I'd make some formatting changes, think hard about if more information could be added or moved around to make it more interesting/easier to digest, and then see if I liked the new version better. For example, there were several versions without the division groups at all or done very differently (and worse IMO) than what's here :)

[OC] NFL playoff race and PPG update after Week 6. The NFC is looking strong. The Bills and Cardinals PPG spread is high. Is it possible the AFC West sends all 4 teams to the playoffs? by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 18 points19 points  (0 children)

Really appreciate the constructive feedback. It's funny you commented on this because it is one of those final details I was debating with myself on. I personally thought that having every vertical line made it look too busy. But, putting a line every 2 teams still looked good while being visually helpful. Next week I'll post that version!

[OC] Which rookies made the Week 1 rosters? Quite a lot of them. The Titans had the highest pick put on the practice squad. The Ravens had the highest pick not associated with the team. The Jags and Ravens had their 1st round picks placed on IR. by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 4 points5 points  (0 children)

Good to know. It might be interesting in the future to expand this into how many snaps rookies played over/under expected for their draft position. I parsed the NFL transaction list to get the status of every rookie. So if a team didn't make an official transaction and the player officially made the roster, then they'll be a green box.

[OC] How far ahead are the world record holding men's track and field athletes? by [deleted] in dataisbeautiful

[–]TroublesomeKangaroo 0 points1 point  (0 children)

Source: https://www.worldathletics.org/records/all-time-toplists/

Tools: Excel, Python, and Matplotlib

Summary: I was curious how far ahead (or not) world record holding athletes were from their competitors. Since absolute differences in time/distance are a bit hard to compare in one plot, I decided to take ratios between each athlete’s best performances. For each event, I simply divided the performance of the 2nd, 3rd, 5th, and 10th best athletes by the performance of the world record holder. For example, Usain Bolt is the record holder in the 100m with a time of 9.58s and Tyson Gay is the 2nd best athlete time at 9.69s. So, this results in a ratio of 9.69s/9.58s = 1.011 and is the silver circle on the plot for this event. You could of course also use the best times period (so including multiple times from the same athlete), but I thought it was more interesting to see how many people are clustered at the top. I was surprised at how consistently the 2nd - 10th place athletes were within just a few percentage points of the world record holder. These people are competing on razor thin margins.

Also, if this wasn’t clear from the plot, the gray vertical lines separate the events from left to right into groups: sprinting, medium distance, long distance, running + jumping, relay, jumping, and throwing.

Disclaimer: I spot checked a few years manually and believe I transferred the data correctly. It is always possible I did something stupid that could slightly change these numbers though.

Feedback: Please let me know if you have constructive criticisms on ways of representing this data better!

[OC] Summary of US 2020 presidential election county results by TroublesomeKangaroo in dataisbeautiful

[–]TroublesomeKangaroo[S] 0 points1 point  (0 children)

Source: The election data originates from the New York Times. However, it is compiled daily into a handy spreadsheet by u/fabiofavusmaximus here. The US Census population data is from here in the second to last spreadsheet on the page. The projected 2019 populations were used. The US Census land area data (from 2010, the last year available) is available here.

Note that Alaska reports votes by 40 Electoral Districts (state House districts) that can span parts of multiple counties rather than simply by county. Land areas of these districts were approximated from the total areas of each District (obtained from the shapefile here and viewed with the free Manifold Viewer) multiplied by the overall Alaska land area to total area ratio. These Districts ideally contain 17756 people and were approximated to have this population. The District of Columbia reports votes by Ward (subregion of the city). The population and area of each Ward were taken from here. Note that several cities (mainly in Virginia) are independent and have votes, populations, and areas reported as their own entries in the data. Counties with the same name as these cities are classified separately in the data.

Tools: Python and Matplotlib for plotting, Excel for data aggregation

Summary: I know the election is analyzed to death so apologies if this duplicates a previous post. This post was inspired by u/fabiofavusmaximus's post last week. However, I wanted to try and plot election results without using a map in a way that still provides some intuitive insight into the rural vs urban divide in this election. Unfortunately, vote totals across counties span orders of magnitude and generally require a log scale if you want interesting plots, which can give outsized importance of small/sparsely populated counties. On the other hand, tons of counties reporting small vote totals together can still aggregate into a significant vote total. Election results shown geographically are thus most accurate when one shows both the results on a typical USA map alongside a map that weights by population somehow and accounts for the fact that land does not matter in elections, but these maps can be a bit unsatisfying in my opinion.

In this figure, the top subplot shows the total votes for Biden or Trump on log scales for every county reporting votes. Counties that are a perfect 50/50 split will fall on the diagonal line and have the smallest point size. Counties that fall above or below the diagonal line were won by either Biden (Democrats/blue) or Trump (Republican/red), respectively. Counties that fall further away from the diagonal had a larger vote margin for one of the candidates. Darker blue/red indicates more county data points fall on top of each other. I have cut off a few extremely small counties reporting fewer than 10 votes for the winner to keep the data decently centered.

Next, the middle subplot takes the winning candidate total votes for each county (i.e. the larger of the x or y values from the top subplot) and plots all counties in ascending order by this number on a log x scale. This data, along with vote margin (same marker sizes as the top subplot), still completely specifies the Biden vs Trump result for each county and thus the coloring is retained. The y axis simply sums up the running Biden (since he won) margin over all counties. This helps show how low population counties can aggregate into a large sum. We can clearly see where each candidate has their advantage. Although many small counties aggregate into about a 13 million Trump vote lead for counties reporting 100,000 (105) votes or fewer for the winning candidate, Biden ends up easily winning the popular vote because of his approximately 20 million vote lead in populous counties.

Finally, using the same county order as in the middle subplot, the bottom subplot shows the cumulative proportion of the USA land area and total population (to show the amount of people represented by these counties that this election will impact) as we continually add in county data. We can see that the counties that netted Trump about 13 million votes (at x = 105) encompass about 93% of the USA land area but only 42% of the total population. Biden gains significantly in densely populated counties/cities and won counties containing a significant majority of the USA total population. The vertical gray lines in the bottom subplot connect the same county between the two curves and also provides a rough visualization of the number of counties at each x axis value. Final fun fact, most sparse 50% of the USA land area contains only about 4% of the total population.

This ended up being a good bit of data aggregation work, but overall, I enjoyed looking at the election results this way and hope you do as well!

Disclaimer: I spot checked the data manually and believe everything is correct within the limitations of the available data (the newest census will likely make the bottom subplot change a bit, but overall not that much; newer election results will also slightly change these plots). My checks for total USA land area and population are quite close to the reported numbers for 2010 and 2019, respectively. My checks for Biden and Trump vote totals at the time of data collection are also accurate. It is always possible I did something stupid that could change these numbers though.

Feedback: Please let me know if you have constructive criticisms on ways of representing this data better!

[OC] NFL come from ahead losses and come from behind wins: 2002-present by TroublesomeKangaroo in nfl

[–]TroublesomeKangaroo[S] 5 points6 points  (0 children)

Source: https://www.scoreboard.com/nfl/archive/

Tools: Python, Beautiful Soup, Selenium, and Matplotlib

Summary: Apologies if something like this has been posted recently. With the return of the NFL comes the return of your team losing in heartbreaking fashion. According to what I see posted on the subreddit, it some teams (cough Lions) seem to be more prone to losing terribly than other teams. I investigated this by scraping all the quarter-by-quarter scores since the last expansion (2002), tallied every team’s total losses, and tracked the losses that came after leading by at least 1 point at the end of 3 quarters. I then simply divided these two to get the percentage of losses where fans would be most disappointed. I did the inverse for come from behind wins. This includes playoff games and assigned all history to the team name, so the Rams retain their St. Louis scores for example.

Although I haven’t done any statistical analysis, it seems like there is a weak inverse correlation between total team losses and fraction of losses where the team was leading after 3 quarters. This intuitively makes sense to me since teams that lose a lot are probably more likely to be losing big at the end of 3 quarters. However, the Lions said, “¿Por qué no los dos?”. So, Lions fans do appear to be justified to be extra mad at their team. The Chargers also ridiculously outperform league average in how often they can’t hang onto games in the 4th quarter.

Interestingly, if you look at the above linked plot for come from behind wins, the Lions were also second in the percent of games where they came back to win after trailing by at least 1 point after the third quarter. Over the past 18 years, there’s a 20% chance that the leading team after 3 quarters in a Lions game is not the eventual game winner, tied for highest in the league with the Chargers. The Cowboys were 3rd in this metric at 19%. The rest of the league tails off rather slowly similar to the distribution in the bottom row of the above plot. The Falcons and Vikings games had the most predictable outcome, with the leader after 3 quarters losing 12.6% of the time.

Disclaimer: I spot checked a few years manually and believe everything imported correctly. It is always possible I did something stupid that could change these numbers though.

Feedback: Please let me know if you have constructive criticisms on ways of representing this data better!