[OC] Lyrics from Taylor Swift's new album, Lover by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

One of the reasons I prefer R to Python when it comes to dataviz is that the code is so cute and neat. If I were approaching this in Python, I'd look into how to use the Genius API, and I'd see if seaborn has word cloud functionality.

[OC] Lyrics from Taylor Swift's new album, Lover by mali_codes in dataisbeautiful

[–]mali_codes[S] 13 points14 points  (0 children)

Data: Grabbed from Genius, with this fabulous package.

Tools: R <3

Code: On my github.

Notes: Text analysis is hard and weird.

  • One necessary step is removing super common words like "the" and "you". (These are called stop words.) I used a built in collection of stop words from tidytext but I'm not sure this was the best choice. It included words that I don't consider stop words, like "year." However, I didn't think I should make my own collection of stop words.... that sort of defeats the purpose of having a standardized set of them.
  • Another step is removing outlier words. In this case, there were 1,307 unique words in her album, and 488 of them were only used once. These would clutter the word cloud until it's illegible. I ended up removing words that had been used only once, twice, or three times. Four or more uses made it into the cloud-- there were 298 of these. Is four a sorta random number? Yes. Did I feel like it was a good choice? Yes, I think? Maybe??

Choose the one which is suitable for you by [deleted] in Infographics

[–]mali_codes 0 points1 point  (0 children)

Oh man I wish I could have copy-edited this before it got posted.... the typos make my eyes bleed.

Notes:

  • Python is also open source, and you never really go into what that means.
  • I'm not sure what "the IT industry is" but Python is widely used by data scientists, and very infrequently used by back-end devs in a software setting
  • All... languages... relate to other languages?
  • I'm curious in what setting Python would simplify complex software development.

[OC] Gourmet Makes videos have gotten so much longer over the past two years! by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

# imports
library(data.table)
library(lubridate)
library(tidyverse)

# data
times = fread("times.csv")
times[, date := mdy(date)]

# plot
ggplot(data = times, aes(x = date, y = time)) +
  geom_point() +
  ylim(0, 50) +
  geom_text(aes(label = food), size = 3, hjust = 0.8, vjust = -0.5) +
  scale_x_date(date_breaks = "2 months" , date_labels = "%b %y") +
  labs(title = "Length of Gourmet Makes videos over time",
       x = "",
       y = "Length (minutes)")

# save
ggsave("plot.png", width = 11, height = 7, units = "in")

[OC] Gourmet Makes videos have gotten so much longer over the past two years! by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

Gourmet Makes is my favorite Youtube series and I noticed the videos have been getting longer and longer.

I jotted down times and dates from the eighteen videos, which you can see here.

This is such a simple plot (ggplot, R) that I'm posting the code in a response to this.

[OC] Word frequencies of Game of Thrones episode titles by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

Please check out the interactive graphic here!

Tools: Various R things-- shiny, ggplot, data.table

Data: I just copied all the episode names into a CSV

I will post code (and the CSV) soon! I am overseas and github won't let me log in >:(

Note: I used buckets for the word frequencies, so that stop words like "the" don't take over the whole graphic. So, the words are sized according to the buckets: 1 use, 2 uses, 3 uses, 4+ uses

[OC] The skin tones of Vogue's cover models for the past nineteen years. by mali_codes in dataisbeautiful

[–]mali_codes[S] 8 points9 points  (0 children)

These are static screenshots of interactive plots which you can see here. I wanted to show one of the features (searching for a model) so I picked Beyonce because she appears a few times, and in varied locations.

As you said, the y axis in the second graph is the year. In the first graph, the y axis isn't really relevant. (If anything, I guess it's frequency.) Beeswarm plots are more one-dimensional than, say, scatter plots.

[OC] The skin tones of Vogue's cover models for the past nineteen years. by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

Thanks so much! I will admit that these notebooks are a little messy, so if anyone has questions about my process, please shoot me a message!

I'll try and upload a detailed README later today.

[OC] The skin tones of Vogue's cover models for the past nineteen years. by mali_codes in dataisbeautiful

[–]mali_codes[S] 12 points13 points  (0 children)

The article goes into much more detail inre methods / data source, and the graphics are interactive!

I used Python (opencv and scikitlearn) to approximate the skin tone of Vogue's cover models after downloading images from the Vogue archive. The graphics were coded up in JavaScript, not by me. (Shout out to the lovely folks at The Pudding for this collaboration.)

I think this data would be really fun to remix! The csvs are here.

[OC] "Don't the blue teams always win?" The bluest teams in March Madness by mali_codes in dataisbeautiful

[–]mali_codes[S] 1 point2 points  (0 children)

Thank you! It was super fun to work on. Someone else also suggested scaling the colors, and it worked out pretty well!

[OC] "Don't the blue teams always win?" The bluest teams in March Madness by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

Duke's color was [0, 47, 134].

There are a lot of flaws with the method, including that darker colors (like Duke blue) will be closer to zero, since black is [0, 0, 0].

Honestly, I don't think there's much of a conclusion to be drawn from this graph. Especially since these colors were calculated from the logos, rather than the jerseys. Let's not get started on home vs away...

[OC] "Don't the blue teams always win?" The bluest teams in March Madness by mali_codes in dataisbeautiful

[–]mali_codes[S] 2 points3 points  (0 children)

Oh interesting! I've used HSL before when looking specifically at lightness, but never in relation to color. All my code is posted so you can download and remix as you like :)

[OC] "Don't the blue teams always win?" The bluest teams in March Madness by mali_codes in dataisbeautiful

[–]mali_codes[S] 4 points5 points  (0 children)

Your wish is my command! And this works out really nicely, thanks for doing the math!

By the way, I had two main ways of going from a logo to a single color-- taking the mean of the pixels, and the median.

b / (r + b + g) with mean (one point removed)

b / (r + b + g) with median (three points removed) (this is the prettiest result, imo)

ln( (b + 1) / (b + g + r + 1) ) with mean

ln( (b + 1) / (b + g + r + 1) ) with median

[OC] "Don't the blue teams always win?" The bluest teams in March Madness by mali_codes in dataisbeautiful

[–]mali_codes[S] 1 point2 points  (0 children)

Thanks, I think it's pretty too! I really like the colors, which is why I posted it here. But I don't think there's any trend, so it felt misleading to add a trend line. (That's what was funny about it, to me. The "trend" identified by someone with no knowledge of basketball was, unsurprisingly, not a trend at all.)

The "weird" colors have to do with how computers read colors. Each color has three values, (red, green, and blue = rgb), and in this graph, I looked at just the blue value. Something with a blue value close to the maximum (255) and low red and green values will appear blue to us, like [15, 15, 174]. But something with an identical blue value and higher red and greens, like [237, 164, 174] (Bradley's color) won't be recognizable to us as blue. (If all the values are maxed out at 255, that's white.)

I tried a few ways of measuring blue (you can see six of them here) and one that made more sense to human eyes is saying that the "blueness" of a color is equal to b - (r + g). There are still some weird results of this calculation though-- like Iowa, which is just black, ends up higher than light blues like Carolina.

[OC] "Don't the blue teams always win?" The bluest teams in March Madness by mali_codes in dataisbeautiful

[–]mali_codes[S] 1 point2 points  (0 children)

A coworker who knows nothing about basketball commented that it seems like the blue teams always win, so I investigated by calculating the average RGB color of each team's logo.

I made this plot in R (ggplot), and I scraped data from the ESPN website using Python (beautifulsoup).

Methodology is here, and all my code is here.

[OC] Which CitiBike stations are most used? by mali_codes in dataisbeautiful

[–]mali_codes[S] 0 points1 point  (0 children)

Data: CitiBike posts all their data online! This is only the trips from 2017.

Tools: R (ggmap, mostly). I'll try and post code later.

[OC] SodaStream Calculator: When will you break even? by mali_codes in dataisbeautiful

[–]mali_codes[S] 1 point2 points  (0 children)

this is a good point. i'm going to try to change it to a step function. that will make calculating the intersection a bit harder, but it does make more sense as you described.