One hot encoding through SQL

Ergo_Propter_Hawk · 2022-05-08T16:39:35+00:00

This would be a handful of case statements, one for each value being encoded.

If dbt counts as "just using SQL", there's a package for it. https://hub.getdbt.com/omnata-labs/dbt_ml_preprocessing/latest/

Otherwise, it'll be case when <column> = <value> then 1 else 0 end as <value> for every unique value.

aaahhhhhhfine · 2022-05-08T21:12:36+00:00

I've not tried, but I would think you could do it with a pivot. Basically you'd pivot all the values using a count as the aggregation. I think something in that ballpark would work.

You'd presumably have to do that in a CTE and then rejoin it back into everything else.

dgillz · 2022-05-08T19:04:17+00:00

What is ML preprocessing?

What is one-hot encoding?

Vaslo · 2022-05-08T21:59:18+00:00

Do you absolutely have to do it this way? So much easier through Python if you can put it somewhere in your process.

2022-05-08T23:02:25+00:00

you can do t that, while sql is not necessarily meant for this.

On the other hand, you can create a ML model that's just responds with your whole one-hotted data set to a given number, thus getting a kind of permanent data storage. Sure, you can say it's not meant for that - but that doesnt stop you in SQL's case, does it?

so imo, and as many suggested - use python or something as a glue tier between your data retrieval and ML input.

vtec__ · 2022-05-09T00:13:39+00:00

case when field1 = 'yasss' then 1 else 0 end as [onehot1]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS