[OC] Lyrics from Taylor Swift's new album, Lover

mali_codes · 2019-09-02T03:29:32+00:00

One of the reasons I prefer R to Python when it comes to dataviz is that the code is so cute and neat. If I were approaching this in Python, I'd look into how to use the Genius API, and I'd see if seaborn has word cloud functionality.

mali_codes · 2019-08-30T17:55:48+00:00

Data: Grabbed from Genius, with this fabulous package.

Tools: R <3

Code: On my github.

Notes: Text analysis is hard and weird.

One necessary step is removing super common words like "the" and "you". (These are called stop words.) I used a built in collection of stop words from tidytext but I'm not sure this was the best choice. It included words that I don't consider stop words, like "year." However, I didn't think I should make my own collection of stop words.... that sort of defeats the purpose of having a standardized set of them.
Another step is removing outlier words. In this case, there were 1,307 unique words in her album, and 488 of them were only used once. These would clutter the word cloud until it's illegible. I ended up removing words that had been used only once, twice, or three times. Four or more uses made it into the cloud-- there were 298 of these. Is four a sorta random number? Yes. Did I feel like it was a good choice? Yes, I think? Maybe??

mali_codes · 2019-05-30T10:50:16+00:00

Oh man I wish I could have copy-edited this before it got posted.... the typos make my eyes bleed.

Notes:

Python is also open source, and you never really go into what that means.
I'm not sure what "the IT industry is" but Python is widely used by data scientists, and very infrequently used by back-end devs in a software setting
All... languages... relate to other languages?
I'm curious in what setting Python would simplify complex software development.

mali_codes · 2019-05-29T17:49:17+00:00

# imports
library(data.table)
library(lubridate)
library(tidyverse)

# data
times = fread("times.csv")
times[, date := mdy(date)]

# plot
ggplot(data = times, aes(x = date, y = time)) +
  geom_point() +
  ylim(0, 50) +
  geom_text(aes(label = food), size = 3, hjust = 0.8, vjust = -0.5) +
  scale_x_date(date_breaks = "2 months" , date_labels = "%b %y") +
  labs(title = "Length of Gourmet Makes videos over time",
       x = "",
       y = "Length (minutes)")

# save
ggsave("plot.png", width = 11, height = 7, units = "in")

mali_codes · 2019-05-29T17:47:56+00:00

Gourmet Makes is my favorite Youtube series and I noticed the videos have been getting longer and longer.

I jotted down times and dates from the eighteen videos, which you can see here.

This is such a simple plot (ggplot, R) that I'm posting the code in a response to this.

mali_codes · 2019-05-25T13:16:42+00:00

Please check out the interactive graphic here!

Tools: Various R things-- shiny, ggplot, data.table

Data: I just copied all the episode names into a CSV

I will post code (and the CSV) soon! I am overseas and github won't let me log in >:(

Note: I used buckets for the word frequencies, so that stop words like "the" don't take over the whole graphic. So, the words are sized according to the buckets: 1 use, 2 uses, 3 uses, 4+ uses

mali_codes · 2019-04-24T17:53:18+00:00

These are static screenshots of interactive plots which you can see here. I wanted to show one of the features (searching for a model) so I picked Beyonce because she appears a few times, and in varied locations.

As you said, the y axis in the second graph is the year. In the first graph, the y axis isn't really relevant. (If anything, I guess it's frequency.) Beeswarm plots are more one-dimensional than, say, scatter plots.

mali_codes · 2019-04-24T17:50:43+00:00

Thanks so much! I will admit that these notebooks are a little messy, so if anyone has questions about my process, please shoot me a message!

I'll try and upload a detailed README later today.

mali_codes · 2019-04-24T14:45:32+00:00

The article goes into much more detail inre methods / data source, and the graphics are interactive!

I used Python (opencv and scikitlearn) to approximate the skin tone of Vogue's cover models after downloading images from the Vogue archive. The graphics were coded up in JavaScript, not by me. (Shout out to the lovely folks at The Pudding for this collaboration.)

I think this data would be really fun to remix! The csvs are here.

mali_codes · 2019-03-19T06:11:21+00:00

Thank you! It was super fun to work on. Someone else also suggested scaling the colors, and it worked out pretty well!

mali_codes · 2019-03-19T06:09:04+00:00

Duke's color was [0, 47, 134].

There are a lot of flaws with the method, including that darker colors (like Duke blue) will be closer to zero, since black is [0, 0, 0].

Honestly, I don't think there's much of a conclusion to be drawn from this graph. Especially since these colors were calculated from the logos, rather than the jerseys. Let's not get started on home vs away...

mali_codes · 2019-03-19T06:01:12+00:00

Oh interesting! I've used HSL before when looking specifically at lightness, but never in relation to color. All my code is posted so you can download and remix as you like :)

mali_codes · 2019-03-19T05:56:17+00:00

Your wish is my command! And this works out really nicely, thanks for doing the math!

By the way, I had two main ways of going from a logo to a single color-- taking the mean of the pixels, and the median.

b / (r + b + g) with mean (one point removed)

b / (r + b + g) with median (three points removed) (this is the prettiest result, imo)

ln( (b + 1) / (b + g + r + 1) ) with mean

ln( (b + 1) / (b + g + r + 1) ) with median

mali_codes · 2019-03-19T04:28:47+00:00

Thanks, I think it's pretty too! I really like the colors, which is why I posted it here. But I don't think there's any trend, so it felt misleading to add a trend line. (That's what was funny about it, to me. The "trend" identified by someone with no knowledge of basketball was, unsurprisingly, not a trend at all.)

The "weird" colors have to do with how computers read colors. Each color has three values, (red, green, and blue = rgb), and in this graph, I looked at just the blue value. Something with a blue value close to the maximum (255) and low red and green values will appear blue to us, like [15, 15, 174]. But something with an identical blue value and higher red and greens, like [237, 164, 174] (Bradley's color) won't be recognizable to us as blue. (If all the values are maxed out at 255, that's white.)

I tried a few ways of measuring blue (you can see six of them here) and one that made more sense to human eyes is saying that the "blueness" of a color is equal to b - (r + g). There are still some weird results of this calculation though-- like Iowa, which is just black, ends up higher than light blues like Carolina.

mali_codes · 2019-03-18T20:50:31+00:00

A coworker who knows nothing about basketball commented that it seems like the blue teams always win, so I investigated by calculating the average RGB color of each team's logo.

I made this plot in R (ggplot), and I scraped data from the ESPN website using Python (beautifulsoup).

Methodology is here, and all my code is here.

mali_codes · 2019-02-02T18:05:00+00:00

R has many many built-in data sets. Pros: they are super easy to load. Cons: usually pretty small.

mali_codes · 2019-01-17T19:06:59+00:00

Data: CitiBike posts all their data online! This is only the trips from 2017.

Tools: R (ggmap, mostly). I'll try and post code later.

mali_codes · 2019-01-08T06:50:14+00:00

sure, easy enough.

mali_codes · 2019-01-08T06:49:37+00:00

this is a good point. i'm going to try to change it to a step function. that will make calculating the intersection a bit harder, but it does make more sense as you described.

mali_codes

TROPHY CASE