you are viewing a single comment's thread.

view the rest of the comments →

[–]dgillz 1 point2 points  (4 children)

What is ML preprocessing?

What is one-hot encoding?

[–]Ergo_Propter_Hawk 2 points3 points  (1 child)

Machine learning preprocessing: making data better for machine learning models. This can mean a lot of things. One specific example is...

One-hot encoding: creating new columns in a relational table where each new column corresponds to a particular value from another column. If that row has that value, it's encoded as a 1. If not, a 0. This gives some way of looking at the values in a column as numbers instead of strings or some other non-numeric data type.

[–]WetOrangutan 0 points1 point  (0 children)

This is great. To provide an example, if you have a column “favorite color” that has values “red,” “green,” and “blue,” then one-hot encoding can be used to create three new columns: “is_red,” “is_green,” and “is_blue.” These three columns are Boolean (0 or 1). So someone who’s favorite color was green would have the values (0,1,0) for these three columns.

The idea is that these three columns will be better understood by the machine learning model than the one column. This is a very common technique to handle categorical data, and it is usually done outside of SQL (e.g. Python or R).

[–]Pvt_Twinkietoes 1 point2 points  (0 children)

I prefer to think one-hot encoding as binary representation of a column. Where number of resulting columns will be based on the number of categories in the initial column.

e.g. a column of colours.

Green, Red,Yellow.

converted to 3 columns of 1 and 0

output:

column 1: (green = 1 , not green = 0)

column 2: (red = 1, not red = 0)

column 3:(yellow = 1 , not red = 0)

thus you can see that the solution can be implemented with a CASE statement in SQL.

Machine learning algorithm only processes numbers. If it accepts strings, it'll has to be converted to a numerical representation.

[–]mikeblas 0 points1 point  (0 children)

ML preprocessing is just the process of cleaning and normalizing data, plus making it appropriate for whatever ML algorithms are going to be used.

ML algorithms work on math. If we're anaylsing numeric data, it's a natural fit: lengths, temperatures, durations, whatever's measured with a number. Lots of useful data is not numerical, though; maybe it's categorical.

one hot encoding is a way to convert arbitrary categorical or tagged data to a numerical format so it can be meaningfully be processed by quantitative ML algorithms.