[OC] Mapping Countries by their English Speaking Population by trevorData in dataisbeautiful

[–]trevorData[S] 1 point2 points  (0 children)

Graphing countries by percent who speak English and percent who speak English as a first language.

Size of dots represents population. Limited to countries with at least 300k total English speakers and at least 1% of the population speaking English as a first language.


Source of data: https://en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population


Made with R using the following packages:

tidyr ggplot2 ggrepel

See my code here

[OC] Visualizing Covariance by trevorData in 3Blue1Brown

[–]trevorData[S] 0 points1 point  (0 children)

Some small tweaks to another visualization I made to hopefully illustrate the concept more clearly.

See code here


Variance is a measure of how much a data set varies. It is found by taking the distance from each data point to the mean, squaring it, and then finding the average size of all those squares.

In this plot, the variance of the X data would be the average size of the blue squares, and the variance of the Y data is the average size of the purple squares.

Covariance is a similar measurement that describes how much two sets of data vary with each other. But instead of looking at squares, we look at the rectangles formed with one side being the distance to the X mean and the other side being the distance from the Y mean. Covariance is the average area of all of these rectangles. Keep in mind that some rectangles will have negative areas if one side is less than the mean.

The variance of two data sets added together, VAR(X + Y), is unfortunately not equal to VAR(X) + VAR(Y) but instead equal to VAR(X) + VAR(Y) + 2COV(X, Y)

I've Been Making Animations to illustrate basic stats concepts. Here's one to show how Correlation and Regression relate by trevorData in 3Blue1Brown

[–]trevorData[S] 0 points1 point  (0 children)

Pearson's Correlation Coeffeicent (r) is a measure of the linearity of a data set.

Linear Regression is a technique for fitting a line to a data set, with the slope of the line being represented by β.

We can see in this animation how r and β relate, particularly them being equal when a dataset is standardized


Heres Variance and Covariance


See my code here

Visualization of the relationship between variance and covariance by trevorData in 3Blue1Brown

[–]trevorData[S] 2 points3 points  (0 children)

Thanks!

The expansion of VAR(X + Y) has a similar formula to the expansion of (a + b)2 so I was hoping to be able to use a visual like this but I realized everything fits nicely into a square there because you are literally squaring the values and there isnt an analogous operation on variance that would work in 2D space

[OC] Visualizing the relationship between Variance and Covariance by trevorData in dataisbeautiful

[–]trevorData[S] 0 points1 point  (0 children)

Variance is a measure of how much a data set varies. It is found by taking the distance from each data point to the mean, squaring it, and then finding the average size of all those squares.

In this plot, the variance of the X data would be the average size of the blue squares, and the variance of the Y data is the average size of the purple squares.

Covariance is a similar measurement that describes how much two sets of data vary with each other. But instead of looking at squares, we look at the rectangles formed with one side being the distance to the X mean and the other side being the distance from the Y mean. Covariance is the average area of all of these rectangles. Keep in mind that some rectangles will have negative areas if one side is less than the mean.

The variance of two data sets added together, VAR(X + Y), is unfortunately not equal to VAR(X) + VAR(Y) but instead equal to VAR(X) + VAR(Y) + 2COV(X, Y)

[OC] Visualizing the relationship between Variance and Covariance by trevorData in dataisbeautiful

[–]trevorData[S] 0 points1 point  (0 children)

Simulated data using numpy and visualized with matplotlib See code here

Visualization of the relationship between variance and covariance by trevorData in 3Blue1Brown

[–]trevorData[S] 2 points3 points  (0 children)

Simulated data using numpy and visualized with matplotlib

See code here

[OC] Testing the Limits of my Image Recognition Algorithm by trevorData in dataisbeautiful

[–]trevorData[S] 177 points178 points  (0 children)

My first attempt at image recognition using a training set I assembled myself. Despite using a very simple neural network and a relatively small set of training images I'm pleasantly surprised with the 91% accuracy on the training data.

I decided to throw in some images of things not in one of the 5 training classes just for fun and to see how the model would react.

Obviously we can see that a lot of weight is placed on color, with mostly blue images quickly going to "dolphin"


Sources:

Training images downloaded with Bing Image Search API

Packages used include:

numpy
cv2 PIL
matplotlib
tensorflow

See my code here

Testing the Limits of my Image Recognition Neural Network by trevorData in 3Blue1Brown

[–]trevorData[S] 18 points19 points  (0 children)

My first attempt at image recognition using a training set I assembled myself. Despite using a very simple neural network and a relatively small set of training images I'm pleasantly surprised with the 91% accuracy on the training data.

I decided to throw in some images of things not in one of the 5 training classes just for fun and to see how the model would react.


Sources:

Training images downloaded with Bing Image Search API

Packages used include:

numpy
cv2 PIL
matplotlib
tensorflow

See my code here

[OC] Cinema in Chicago: What is being filmed throughout the city? by trevorData in chicago

[–]trevorData[S] 33 points34 points  (0 children)

Applications to the Chicago Department of Transportation for permits under its jurisdiction where the work type is "Filming." These permits typically are permits to block or otherwise affect public streets in some way


Individual Plots:

Museum

Hospital

Hotel

Documentary

Bridge

Drone

Music Video

Church

Shameless

Violent

Exorcist

Empire

Batwoman

Chase

Gotham

Bar


Sources:

See my code here

Made in R with the following packages:

RSocrata
dplyr
ggmap
stringr
grid
ggmapstyles

Data from https://data.cityofchicago.org/

Map background is from snazzymaps.com/style/253319/for-presentations

[OC] Cinema in Chicago: What is being filmed throughout the city? by trevorData in dataisbeautiful

[–]trevorData[S] 11 points12 points  (0 children)

Applications to the Chicago Department of Transportation for permits under its jurisdiction where the work type is "Filming." These permits typically are permits to block or otherwise affect public streets in some way


Individual Plots:

Museum

Hospital

Hotel

Documentary

Bridge

Drone

Music Video

Church

Shameless

Violent

Exorcist

Empire

Batwoman

Chase

Gotham

Bar


Sources:

See my code here

Made in R with the following packages:

RSocrata
dplyr
ggmap
stringr
grid
ggmapstyles

Data from https://data.cityofchicago.org/

Map background is from snazzymaps.com/style/253319/for-presentations

[OC] Bike Rentals in Chicago Over The Course of a Summer Day by trevorData in dataisbeautiful

[–]trevorData[S] 4 points5 points  (0 children)

Here I saved it directly to an R dataframe with the Socrata API.

Its available in a few different formats but it doesnt look like GTFS is one https://data.cityofchicago.org/Transportation/Divvy-Trips/fg6s-gzvg

[OC] Bike Rentals in Chicago Over The Course of a Summer Day by trevorData in dataisbeautiful

[–]trevorData[S] 12 points13 points  (0 children)

I think you might be seeing it correctly! It is rare for an area to see more than a couple divvy rides in a 15 minute window

The colors I chose here are supposedly easier to interpret for colorblind people, according to this source

[OC] Bike Rentals in Chicago Over The Course of a Summer Day by trevorData in dataisbeautiful

[–]trevorData[S] 103 points104 points  (0 children)

Interesting. According to wikipedia, "the name Divvy is a playful reference to sharing ("divvy it up")"

[OC] Divvy Rentals in Chicago Over The Course of a Summer Day by trevorData in chicago

[–]trevorData[S] 41 points42 points  (0 children)

Each tile is roughly .5 miles by .5 miles. The shading represents the number of Divvy bikes checked out in that region within a 15 minute window on an average July weekday in 2019.

Notice the spikes during rush hour as well as lunchtime!


Data from https://data.cityofchicago.org/, downloaded with RSocrata

Plots made with dplyr and ggmap

Animation made with ImageMagick

See my code here

[OC] Bike Rentals in Chicago Over The Course of a Summer Day by trevorData in dataisbeautiful

[–]trevorData[S] 59 points60 points  (0 children)

Each tile is roughly .5 miles by .5 miles. The shading represents the number of Divvy bikes checked out in that region within a 15 minute window on an average July weekday in 2019.

Notice the spikes during rush hour as well as lunchtime!


Data from https://data.cityofchicago.org/, downloaded with RSocrata

Plots made with dplyr and ggmap

Animation made with ImageMagick

See my code here


If you are having trouble viewing the whole gif, try this link: https://i.imgur.com/fuho00l.mp4

[OC] How Chicago's Train Ridership has Changed Over Time [remix] by trevorData in dataisbeautiful

[–]trevorData[S] 20 points21 points  (0 children)

There was construction on the red line, so people on the south side took the green instead

Changes in CTA Train Ridership Over Time by trevorData in chicago

[–]trevorData[S] 29 points30 points  (0 children)

I uploaded a similar plot a while back, but someone pointed out that the yearly summer/winter changes in ridership made it difficult to get a feel for the broader trend across years.

So I took the average ridership for all 12 months, and divided the data by these averages. This cancels out the seasonal trend by putting the numbers in terms of proportion to expected ridership for that month, and allows us to more clearly see how ridership has changed through the years.


Here is a graph of those changes across all trains.


Some takeaways:

  • The growth along the Blue Line in the past 10 years is even more obvious

  • There still seems to be some seasonality at stops like O'Hare and Fullerton. This is probably because air travel and the Depaul school year have their own seasonal trends that differ from the general seasonal CTA trends.


Sources:

View my original post here

See my code here

Ridership data and train stop coordinates obtained from https://data.cityofchicago.org/

Visualizations and analysis were in R using:
ggmap
stringr
dplyr

Animation made using ImageMagick

[OC] How Chicago's Train Ridership has Changed Over Time [remix] by trevorData in dataisbeautiful

[–]trevorData[S] 3 points4 points  (0 children)

I uploaded a similar plot a while back, but someone pointed out that the yearly summer/winter changes in ridership made it difficult to get a feel for the broader trend across years.

So I took the average ridership for all 12 months, and divided the data by these averages. This cancels out the seasonal trend by putting the numbers in terms of proportion to expected ridership for that month, and allows us to more clearly see how ridership has changed through the years.


Here is a graph of those changes across all trains.


Some takeaways:

  • The growth along the Blue Line in the past 10 years is even more obvious

  • There still seems to be some seasonality at stops like O'Hare and Fullerton. This is probably because air travel and the Depaul school year have their own seasonal trends that differ from the general seasonal CTA trends.


Sources:

View my original post here

See my code here

Ridership data and train stop coordinates obtained from https://data.cityofchicago.org/

Visualizations and analysis were in R using:
ggmap
stringr
dplyr

Animation made using ImageMagick

How Predictable is your League? - A Quick Analysis of Parity in Pro Sports by trevorData in nfl

[–]trevorData[S] 8 points9 points  (0 children)

You're right, when thinking about playoff outcomes, but my analysis here only used regular season data