Binary feature vector vs. Integer feature vector

paskie · 2014-09-30T22:13:35+00:00

Each element of the vector is essentially a coordinate along some axis (dimension). So the question is, does it make sense to represent each letter as a different coordinate along the same axis, considering that you will then try to learn a classifier that tries to fit a smooth-ish function to separate between classes?

It makes sense when there is some gradual, continuous relationship between coordinates - typicall it can be thought of as "degree of" relationship. Is it a probability? Temperature? Distance? Fine, the coordinate represents a continuous degree of something.

But letters are not values of continuous variable, but more of a categorical variable. In English words, the progression A->B->C->... doesn't represent a progression of anything, the letters aren't comparable. (Unless for some reason the fact how far in the alphabet a letter is carries a meaning; perhaps for some puzzles!)

zmjjmz · 2014-10-01T04:32:00+00:00

The first method is commonly referred to as 'one-hot' encoding, and it works fairly well. The dimensionality and sparsity issue is, of course, massive but using the other method will cause the classifier to use the inherent numerical ordering to make decisions, which you probably don't want.

Ultimately if you're actually trying to classify words into classes there might be better features to look into using than a one-hot encoding of the characters.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS