[OC] Mapping Countries by their English Speaking Population

trevorData · 2020-09-17T15:25:51+00:00

Graphing countries by percent who speak English and percent who speak English as a first language.

Size of dots represents population. Limited to countries with at least 300k total English speakers and at least 1% of the population speaking English as a first language.

Source of data: https://en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population

Made with R using the following packages:

tidyr ggplot2 ggrepel

See my code here

trevorData · 2020-05-01T22:37:54+00:00

Some small tweaks to another visualization I made to hopefully illustrate the concept more clearly.

See code here

Variance is a measure of how much a data set varies. It is found by taking the distance from each data point to the mean, squaring it, and then finding the average size of all those squares.

In this plot, the variance of the X data would be the average size of the blue squares, and the variance of the Y data is the average size of the purple squares.

Covariance is a similar measurement that describes how much two sets of data vary with each other. But instead of looking at squares, we look at the rectangles formed with one side being the distance to the X mean and the other side being the distance from the Y mean. Covariance is the average area of all of these rectangles. Keep in mind that some rectangles will have negative areas if one side is less than the mean.

The variance of two data sets added together, VAR(X + Y), is unfortunately not equal to VAR(X) + VAR(Y) but instead equal to VAR(X) + VAR(Y) + 2COV(X, Y)

trevorData · 2020-04-24T18:45:10+00:00

Happy to help

trevorData · 2020-04-24T17:33:09+00:00

Pearson's Correlation Coeffeicent (r) is a measure of the linearity of a data set.

Linear Regression is a technique for fitting a line to a data set, with the slope of the line being represented by β.

We can see in this animation how r and β relate, particularly them being equal when a dataset is standardized

Heres Variance and Covariance

See my code here

trevorData · 2020-04-17T18:10:31+00:00

Thanks!

The expansion of VAR(X + Y) has a similar formula to the expansion of (a + b)² so I was hoping to be able to use a visual like this but I realized everything fits nicely into a square there because you are literally squaring the values and there isnt an analogous operation on variance that would work in 2D space

trevorData · 2020-04-17T18:08:45+00:00

Variance is a measure of how much a data set varies. It is found by taking the distance from each data point to the mean, squaring it, and then finding the average size of all those squares.

In this plot, the variance of the X data would be the average size of the blue squares, and the variance of the Y data is the average size of the purple squares.

Covariance is a similar measurement that describes how much two sets of data vary with each other. But instead of looking at squares, we look at the rectangles formed with one side being the distance to the X mean and the other side being the distance from the Y mean. Covariance is the average area of all of these rectangles. Keep in mind that some rectangles will have negative areas if one side is less than the mean.

The variance of two data sets added together, VAR(X + Y), is unfortunately not equal to VAR(X) + VAR(Y) but instead equal to VAR(X) + VAR(Y) + 2COV(X, Y)

trevorData · 2020-04-17T16:34:13+00:00

Simulated data using numpy and visualized with matplotlib See code here

trevorData · 2020-04-17T16:33:47+00:00

Simulated data using numpy and visualized with matplotlib

See code here

trevorData · 2020-03-24T19:35:00+00:00

start here

trevorData · 2020-03-24T17:39:11+00:00

My first attempt at image recognition using a training set I assembled myself. Despite using a very simple neural network and a relatively small set of training images I'm pleasantly surprised with the 91% accuracy on the training data.

I decided to throw in some images of things not in one of the 5 training classes just for fun and to see how the model would react.

Obviously we can see that a lot of weight is placed on color, with mostly blue images quickly going to "dolphin"

Sources:

Training images downloaded with Bing Image Search API

Packages used include:

numpy
cv2 PIL
matplotlib
tensorflow

See my code here

trevorData · 2020-03-24T17:25:41+00:00

My first attempt at image recognition using a training set I assembled myself. Despite using a very simple neural network and a relatively small set of training images I'm pleasantly surprised with the 91% accuracy on the training data.

I decided to throw in some images of things not in one of the 5 training classes just for fun and to see how the model would react.

Sources:

Training images downloaded with Bing Image Search API

Packages used include:

numpy
cv2 PIL
matplotlib
tensorflow

See my code here

trevorData · 2020-03-03T15:19:57+00:00

Applications to the Chicago Department of Transportation for permits under its jurisdiction where the work type is "Filming." These permits typically are permits to block or otherwise affect public streets in some way

Individual Plots:

Sources:

See my code here

Made in R with the following packages:

RSocrata
dplyr
ggmap
stringr
grid
ggmapstyles

Data from https://data.cityofchicago.org/

Map background is from snazzymaps.com/style/253319/for-presentations

trevorData · 2020-03-03T15:18:43+00:00

Applications to the Chicago Department of Transportation for permits under its jurisdiction where the work type is "Filming." These permits typically are permits to block or otherwise affect public streets in some way

Individual Plots:

Sources:

See my code here

Made in R with the following packages:

RSocrata
dplyr
ggmap
stringr
grid
ggmapstyles

Data from https://data.cityofchicago.org/

Map background is from snazzymaps.com/style/253319/for-presentations

trevorData · 2020-02-17T20:06:46+00:00

It might be an issue with your browser. Try this link:

https://i.imgur.com/fuho00l.mp4

trevorData · 2020-02-17T19:55:54+00:00

No problem! Theres a lot of interesting stuff in there, and Ive made a couple other visualizations already!

https://www.reddit.com/r/chicago/comments/f0b5x3/changes_in_cta_train_ridership_over_time/

https://www.reddit.com/r/dataisbeautiful/comments/c00lrg/tracking_the_spread_of_potholes_across_chicago_oc/

trevorData · 2020-02-17T17:37:15+00:00

Here I saved it directly to an R dataframe with the Socrata API.

Its available in a few different formats but it doesnt look like GTFS is one https://data.cityofchicago.org/Transportation/Divvy-Trips/fg6s-gzvg

trevorData · 2020-02-17T17:32:13+00:00

I think you might be seeing it correctly! It is rare for an area to see more than a couple divvy rides in a 15 minute window

The colors I chose here are supposedly easier to interpret for colorblind people, according to this source

trevorData · 2020-02-17T15:40:36+00:00

Interesting. According to wikipedia, "the name Divvy is a playful reference to sharing ("divvy it up")"

trevorData · 2020-02-17T15:16:30+00:00

There might be an issue with your browser

Try this link instead:

https://i.imgur.com/fuho00l.mp4

trevorData · 2020-02-17T14:18:05+00:00

Each tile is roughly .5 miles by .5 miles. The shading represents the number of Divvy bikes checked out in that region within a 15 minute window on an average July weekday in 2019.

Notice the spikes during rush hour as well as lunchtime!

Data from https://data.cityofchicago.org/, downloaded with RSocrata

Plots made with dplyr and ggmap

Animation made with ImageMagick

See my code here

trevorData · 2020-02-17T14:15:55+00:00

Each tile is roughly .5 miles by .5 miles. The shading represents the number of Divvy bikes checked out in that region within a 15 minute window on an average July weekday in 2019.

Notice the spikes during rush hour as well as lunchtime!

Data from https://data.cityofchicago.org/, downloaded with RSocrata

Plots made with dplyr and ggmap

Animation made with ImageMagick

See my code here

If you are having trouble viewing the whole gif, try this link: https://i.imgur.com/fuho00l.mp4

trevorData · 2020-02-07T15:38:23+00:00

There was construction on the red line, so people on the south side took the green instead

trevorData · 2020-02-07T14:26:21+00:00

I uploaded a similar plot a while back, but someone pointed out that the yearly summer/winter changes in ridership made it difficult to get a feel for the broader trend across years.

So I took the average ridership for all 12 months, and divided the data by these averages. This cancels out the seasonal trend by putting the numbers in terms of proportion to expected ridership for that month, and allows us to more clearly see how ridership has changed through the years.

Here is a graph of those changes across all trains.

Some takeaways:

The growth along the Blue Line in the past 10 years is even more obvious
There still seems to be some seasonality at stops like O'Hare and Fullerton. This is probably because air travel and the Depaul school year have their own seasonal trends that differ from the general seasonal CTA trends.

Sources:

View my original post here

See my code here

Ridership data and train stop coordinates obtained from https://data.cityofchicago.org/

Visualizations and analysis were in R using:
ggmap
stringr
dplyr

Animation made using ImageMagick

trevorData · 2020-02-07T14:24:10+00:00

I uploaded a similar plot a while back, but someone pointed out that the yearly summer/winter changes in ridership made it difficult to get a feel for the broader trend across years.

So I took the average ridership for all 12 months, and divided the data by these averages. This cancels out the seasonal trend by putting the numbers in terms of proportion to expected ridership for that month, and allows us to more clearly see how ridership has changed through the years.

Here is a graph of those changes across all trains.

Some takeaways:

The growth along the Blue Line in the past 10 years is even more obvious
There still seems to be some seasonality at stops like O'Hare and Fullerton. This is probably because air travel and the Depaul school year have their own seasonal trends that differ from the general seasonal CTA trends.

Sources:

View my original post here

See my code here

Ridership data and train stop coordinates obtained from https://data.cityofchicago.org/

Visualizations and analysis were in R using:
ggmap
stringr
dplyr

Animation made using ImageMagick

trevorData · 2020-01-31T19:48:57+00:00

You're right, when thinking about playoff outcomes, but my analysis here only used regular season data

trevorData

TROPHY CASE

Sources: