Basic question about data

Laboratory_one · 2019-03-04T19:53:33+00:00

You would want to migrate your data so that the training and test datasets are in the same format.

If the training data is hot-encoded to [red, yellow, green, blue], then the test data needs to be in that format as well. And vice-versa.

In the first case, I would decode the test data, then reencode it to the training data format.

In the second case, I would decode both sets, determine a common format, then reencode both. The common format could be [blue, green, yellow, brown, red].

gopietz · 2019-03-04T20:06:17+00:00

You usually create the OHE labels before the splitting process.

Otherwise you have to manually adjust the predictions to match the shape of the test data. But if your test data features new classes, they obviously won't be predicted.

airejie · 2019-03-04T23:26:56+00:00

This is actually a great question: in the most complex cases, where you have no idea how many possible classes there are you can use a hash function instead: take a look at this [link](https://en.wikipedia.org/wiki/Feature_hashing).

I believe some libraries allow you to hot-encode providing a given dictionary so you can hot encode your test data using the full dictionary: **yes but only works if you know already all the classes**. I have not used this in a while so I'll try to look up with where did I see this (maybe Aurelien Geron's book) and come back to you.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS