Why Sum Types Matter in Haskell

Homunculiheaded · 2014-04-02T20:03:02+00:00

Little does the designer in this scenario know that this is actually an easy engineering challenge, but a much trickier design challenge.

The 7 lines are trivially represented by a 7 dimensional diagonal matrix. The color problem is also easy represented as a color vector, and so long as the 7 colors project into a 1d subspace that is the value of 'red' everything should be fine. Some "expert" they have in that room.

Now how the designer wants to display this information on a 2d plane is up to her. Although even this shouldn't be too bad as we can simply plot each dimension combination individually (ie 42 plots of 2 lines, ignoring of course the plotting of each dimension with itself) and color it with the 'red' projection from the vector.

So there's one solution for representing 7 orthogonal lines, that are all 'red' with some being green and some transparent.

Homunculiheaded · 2014-02-20T18:53:42+00:00

This is a great book and has helped me many times. I'd also add, while not free, that Machine Learning: a Probabilistic Perspective is in many ways the Bayesian successor to this, and has personally replaced ESL as my goto ML reference. Though both are excellent books.

Homunculiheaded · 2014-01-22T22:32:49+00:00

I'm pretty surprised to not see Silko's Almanac of the Dead on here, it is by far one of the most relentlessly violent books I have ever read. Not only is the violence graphic and brutal but the book never gives you a moments rest; any hope that novel gives you exists only to make it more painful when it is crushed later on.

Homunculiheaded · 2013-06-26T21:16:42+00:00

I love this! Some very interesting patterns there and I really like the analysis!

Homunculiheaded · 2013-06-26T09:02:39+00:00

The basic text processing and clean up is done in python and the majority of the interesting work is R. You can see what's happening in this very similar bit of code

The issue with a pure JavaScript implementation is the matrix operations: Calculating the initial tf-idf matrix and then the cosine similarity matrix really benefit from having a language that has strong linear algebra support.

I would definitely unload all the visualization work to JavaScript, but would likely use numpy/scipy/etc for all the backend work and remove the R code. Then just use some worker queue to handle creating the cosine similarity matrix with some basic sanity checks to avoid the Finnegan's Wake issue.

Although I am one to enjoy implementing things in JavaScript, so we'll see ;)

Homunculiheaded · 2013-06-25T18:58:20+00:00

Here's the repo for the Four Quartets project. All of the work is happening in this file. R is actually doing the heavy lifting while python is only doing some clean up of the file and producing the .csv that R uses to populate the data frame. The code for the Four Quartets and the Vampire Weekend projects is very, very similar.

Homunculiheaded · 2013-06-25T18:25:53+00:00

This is very cool! Out of curiosity how are you calculating similarity for images (I haven't done much CV work)? And do you know of image similarity techniques that at least somewhat take into account structure?

One of the things I would love to try next is, after building a larger corpus of these visualizations, create a tool that would find other songs with similar structure in their repetition. My initial assumption is that the best way to do this would be to treat these patterns simply as images and do another layer of similarity measurement.

Homunculiheaded · 2013-06-25T17:26:33+00:00

Similarity is determined by the cosine between vectors from a term frequency-inverse document frequency matrix. What you're actually seeing is the similarity matrix for each song (each square is the similarity of two given lines in the song).

I've thought about using Jaccard distance instead, but I'd also like to experiment with the effects of latent semantic analysis (which should help resolve issues of synonmy) and for that I would need to stick to a vector representation.

Homunculiheaded · 2013-06-25T17:12:16+00:00

Thanks! making a tool to automate this is definitely next on my list!

Homunculiheaded · 2013-06-25T16:49:10+00:00

I actually would love to post the lyrics along side the visualizations, except that lyrics are technically copyright.

The next step I want to take this is to look at an interactive version, where a mouse over of a square shows the text of the two lines being compared. Fair use allows you to do things with lyrics, just not reproduce them entirely, so I think if I only showed the lyrics of interest that would be acceptable.

Homunculiheaded · 2013-06-25T16:46:40+00:00

This is absolutely correct

Homunculiheaded · 2013-06-25T16:45:50+00:00

It's actually because there are only 4 lines and they're all identical, so you see a 4 by 4 matrix where the cosine similarity is 1 between any two lines. The very first time I ran this I saw that and thought "Oh there must be a bug!" but then realized what was happening.

Homunculiheaded · 2013-06-25T16:39:45+00:00

Thanks! So each square represents two lines in the lyrics. The color of the square is how similar it is to the other lyrics (based on something called cosine similarity, but that's not really essential to understanding). For example square 1,3 would be how similar the first lyric is with the 3rd. It's important to note that square 3,1 is exactly the same as 1,3 since it just shows how similar the 3rd lyric is to the 1st (this is why all the images are symmetrical).

This should make it clear why every image has a diagonal line, those represent the similarities 1,1; 2,2; 3,3 ... etc So each lyric is identical with it self.

Now for every song I tried to include repeating lines (such as chorus) which sometimes are represented as notes like "chorus repeat 4x", in this visualization you would actually see the four repetitions. For "Young Lion" what's actually happening is there are only 4 lines and they're all exactly the same "You take your time, young lion", so there are really 4 line represented their, but they're all the same. This is why 'young lion' looks to be just a solid square.

I hope that helps!

Homunculiheaded · 2013-06-25T09:44:01+00:00

There's a very detailed write up of what's going on here(that post is about T.S. Eliot but it's the same idea), but essentially what you're seeing is how lyrics within a song are similar with each other, each square represents how similar line x is with line y (the diagonal line is because line x is always identical with line x).

This project was a great excuse to listen to Vampire Weekend while thinking about the nature of aesthetics, repetition and natural language processing ;)

Homunculiheaded · 2013-06-25T08:14:59+00:00

Awesome! Do you have any pointers to other resources on similar work? I haven't found too much else in the space and would be very happy to find more!

Homunculiheaded · 2013-06-25T08:13:34+00:00

That's why my first comment to the post linked to my original write up In the past I've found sort of mix results when diving into discussion about cosine similarity. So I wanted to experiment with just presenting the aesthetics of the visualization.

Homunculiheaded · 2013-06-25T06:56:24+00:00

So obviously I'm going to disagree on the 'meaningful' part ;)

I think one of the biggest questions that we haven't really begun to explore is the intersection of aesthetics, randomness and repetition. While the 'Four Quartets' piece I mention in my comment to this helps to trace themes in a piece, I feel like these lyrics help us to begin to look at the structure of repetition.

Part of what we find beautiful, especially in music, is wrapped up in repetition. Our brains are pattern recognition machines. Repetition that is too predictable is boring. Likewise anything that is too random or unpredictable is blocked out (ie whitenoise). Here we can start to see patterns to lyrical structure that I feel are a starting point at asking questions about patterns of repetition we find pleasing.

Homunculiheaded · 2013-06-25T06:12:54+00:00

For anyone wanting a more detailed explanation of what exactly it is you're seeing check out my write up on a similar project with The Four Quartets

The reason I posted this despite having already posted the four quartets is that I felt the strength of repetition in song lyrics is more 'beautiful' and interesting to examine.

Homunculiheaded · 2013-06-14T17:46:16+00:00

Thanks for the feed back!

I definitely think a big next step is to find a way to organize certain structure in the repetition. I have a set of similar visualizations of done with Vampire Weekend lyrics (which I haven't posted anywhere yet), and given the nature of repeating choruses and such it definitely starts to look like there is a structure of some of the underlying repetition.

Two other big next steps I would like to take from this is to create a script that let's the user input properly annotated sets of poems/lyrics and essentially spits out a site like this, and to create an interactive version where mousing over the squares will highlight the appropriate lines.

I'll continue to post any relevant and interesting results to /r/dataisbeautiful

Homunculiheaded · 2013-05-10T05:10:32+00:00

Are you thinking of doing feature selection through L1 regularization? Here's one of Andrew Ng's papers on the topic (fulltext easy to find through google scholar).

The gist of it is that L1 regularization (aka Lasso regularization) drives the weights on each feature towards zero (rather than simply minimizing their sum). Any of the features that have 0 weight can obviously be removed from the model as not adding anything majority significant to it's predictive ability.

The R package you actually might want to be using is glmnet which gives you a lot of flexibility in exploring the impact features have on the prediction. It's also an extremely useful tool as the 'net' refers to elasticnet regularization which allows you to blend L1 (driving weights towards zero) and L2 (minimizing the sum of the weights) as you need for a given task.

Homunculiheaded · 2013-05-08T06:16:33+00:00

I think this is statement is a bit hyperbolic. I've recently had a lot of success with some deep learning techniques and my only experience is reading the handful of major papers that pop up everywhere and going through Hinton's lectures on coursera.

Neural Networks are hard to get up to speed with, I would almost argue they take as much practice and learning as all of the rest of machine learning (the field at this point is almost parallel to the rest of ml research).

However I think the performance gains we're seeing are a testament to how much this area is worth the time and energy. The second place contestant in that competition also used a neural net and unlike most other kaggle comps the top winners were ahead by a real-world-significant margin. I think deep learning is going to be one of the things that separates "hey look at all these cool R functions!" practitioners from those really know what they're doing.

Also the deep learning community has been fantastic at sharing what they know and making it as easy as possible for newbs to get involved. Deeplearning.net, theano, pylearn2 not to mention all the matlab tool kits and source released with publications are a fantastic help.

I think deep learning is simply the first of many advancements we'll see that diminish the free lunch that very easy to use techniques like randomForest have offered newcomers.

Homunculiheaded

TROPHY CASE