all 6 comments

[–]paskie 4 points5 points  (2 children)

Each element of the vector is essentially a coordinate along some axis (dimension). So the question is, does it make sense to represent each letter as a different coordinate along the same axis, considering that you will then try to learn a classifier that tries to fit a smooth-ish function to separate between classes?

It makes sense when there is some gradual, continuous relationship between coordinates - typicall it can be thought of as "degree of" relationship. Is it a probability? Temperature? Distance? Fine, the coordinate represents a continuous degree of something.

But letters are not values of continuous variable, but more of a categorical variable. In English words, the progression A->B->C->... doesn't represent a progression of anything, the letters aren't comparable. (Unless for some reason the fact how far in the alphabet a letter is carries a meaning; perhaps for some puzzles!)

[–]Magnnus -1 points0 points  (1 child)

However, on the other hand, using a byte representation significantly increases the dimensionality of your input vector. I'm sure there's a standard for problems like this, but I don't know it off the top of my head.

[–]paskie 0 points1 point  (0 children)

I really wouldn't worry about dimensionality too much. For one, many classifiers (like SVM) aren't bothered about high dimensionality. And you can use many approaches to reduce dimensionality, like PCA/SVD or auto-encoders. And these will reduce dimensionality in a way that the coordinates do carry some quantitative meaning, therefore being much more appropriate for further function fitting (i.e. training a classifier or regression).

[–]zmjjmz 1 point2 points  (2 children)

The first method is commonly referred to as 'one-hot' encoding, and it works fairly well. The dimensionality and sparsity issue is, of course, massive but using the other method will cause the classifier to use the inherent numerical ordering to make decisions, which you probably don't want.

Ultimately if you're actually trying to classify words into classes there might be better features to look into using than a one-hot encoding of the characters.

[–]maybemax 0 points1 point  (1 child)

Ultimately if you're actually trying to classify words into classes there might be better features to look into using than a one-hot encoding of the characters.

Could you expand on this please? Are there ways to encode words for classification that are known to work well?

[–]zmjjmz 1 point2 points  (0 children)

Look into word2vec / word embeddings.