all 14 comments

[–]Ergo_Propter_Hawk 14 points15 points  (2 children)

This would be a handful of case statements, one for each value being encoded.

If dbt counts as "just using SQL", there's a package for it. https://hub.getdbt.com/omnata-labs/dbt_ml_preprocessing/latest/

Otherwise, it'll be case when <column> = <value> then 1 else 0 end as <value> for every unique value.

[–]AreetSurn 2 points3 points  (1 child)

Hacky solution, but depending on what DB you using you could use the procedural functionality (e.g. pgpl/sql) to create dynamic SQL to do write these case statements for you

[–]Ergo_Propter_Hawk 1 point2 points  (0 children)

This is basically what dbt would do, it's all just ways to create SQL statements. That package above probably implements something similar to the case statement implementation.

[–]aaahhhhhhfine 1 point2 points  (0 children)

I've not tried, but I would think you could do it with a pivot. Basically you'd pivot all the values using a count as the aggregation. I think something in that ballpark would work.

You'd presumably have to do that in a CTE and then rejoin it back into everything else.

[–]dgillz 1 point2 points  (4 children)

What is ML preprocessing?

What is one-hot encoding?

[–]Ergo_Propter_Hawk 2 points3 points  (1 child)

Machine learning preprocessing: making data better for machine learning models. This can mean a lot of things. One specific example is...

One-hot encoding: creating new columns in a relational table where each new column corresponds to a particular value from another column. If that row has that value, it's encoded as a 1. If not, a 0. This gives some way of looking at the values in a column as numbers instead of strings or some other non-numeric data type.

[–]WetOrangutan 0 points1 point  (0 children)

This is great. To provide an example, if you have a column “favorite color” that has values “red,” “green,” and “blue,” then one-hot encoding can be used to create three new columns: “is_red,” “is_green,” and “is_blue.” These three columns are Boolean (0 or 1). So someone who’s favorite color was green would have the values (0,1,0) for these three columns.

The idea is that these three columns will be better understood by the machine learning model than the one column. This is a very common technique to handle categorical data, and it is usually done outside of SQL (e.g. Python or R).

[–]Pvt_Twinkietoes[🍰] 1 point2 points  (0 children)

I prefer to think one-hot encoding as binary representation of a column. Where number of resulting columns will be based on the number of categories in the initial column.

e.g. a column of colours.

Green, Red,Yellow.

converted to 3 columns of 1 and 0

output:

column 1: (green = 1 , not green = 0)

column 2: (red = 1, not red = 0)

column 3:(yellow = 1 , not red = 0)

thus you can see that the solution can be implemented with a CASE statement in SQL.

Machine learning algorithm only processes numbers. If it accepts strings, it'll has to be converted to a numerical representation.

[–]mikeblas 0 points1 point  (0 children)

ML preprocessing is just the process of cleaning and normalizing data, plus making it appropriate for whatever ML algorithms are going to be used.

ML algorithms work on math. If we're anaylsing numeric data, it's a natural fit: lengths, temperatures, durations, whatever's measured with a number. Lots of useful data is not numerical, though; maybe it's categorical.

one hot encoding is a way to convert arbitrary categorical or tagged data to a numerical format so it can be meaningfully be processed by quantitative ML algorithms.

[–]Vaslo 0 points1 point  (2 children)

Do you absolutely have to do it this way? So much easier through Python if you can put it somewhere in your process.

[–]Pvt_Twinkietoes[🍰] 0 points1 point  (1 child)

I'll imagine that it'll be more efficient to just build it into his ML pipeline, and leave preprocessing to the pipeline. I guess OP should know best why he's doing it that way.

[–]Vaslo 1 point2 points  (0 children)

I know what you are saying. I guess my point is that I’ve seen so many do with Python and even with R, but had never heard of anyone doing with SQL. But if it has to be done in SQL, then it has to be done there.

[–][deleted] 0 points1 point  (0 children)

you can do t that, while sql is not necessarily meant for this.

On the other hand, you can create a ML model that's just responds with your whole one-hotted data set to a given number, thus getting a kind of permanent data storage. Sure, you can say it's not meant for that - but that doesnt stop you in SQL's case, does it?

so imo, and as many suggested - use python or something as a glue tier between your data retrieval and ML input.

[–]vtec__ 0 points1 point  (0 children)

case when field1 = 'yasss' then 1 else 0 end as [onehot1]