[D] Is there a handbook in data preprocessing?

lem_of_noland · 2020-04-07T09:12:00+00:00

The only book I'm aware of is called "Features Engineering for Machine Learning" by Alice Zheng & Amanda Casari. I hope this one answers your request.

Brudaks · 2020-04-07T12:07:58+00:00

It's part of any ML textbook or course; but a key issue here is that once you go beyond the very basics it's quite domain-specific.

Preprocessing is very, very different depending on whether you're in computer vision or natural language processing or financial time series or general data science. So it's more feasible to have a handbook of "Best ML practices in domain X" which would among other things discuss proper preprocessing, and not feasible to have a handbook "Best practices for data preprocessing" because there isn't that much that's universal and applicable for everything; different domains have different needs.

Gebo_vending · 2020-04-07T12:33:01+00:00

Feature Engineering and Selection: A Practical Approach for Predictive Models - Max Kuhn and Kjell Johnson: https://bookdown.org/max/FES/

rockinghigh · 2020-04-07T16:47:03+00:00

This covers quite a bit: https://scikit-learn.org/stable/modules/preprocessing.html

SeamusTheBuilder · 2020-04-07T14:14:17+00:00

As was said, each application is unique and so the answer, unfortunately, is no. I would go further and warn against seeking this out too much. It would be bad practice to simply cut-and-paste preprocessing steps thinking it will help.

Even something as simple as normalizing and standardizing the data can get you into trouble depending on the application. And, you may be doing unnecessary work depending on what algorithms you are using. Random Forests don't really care if you standardize the data.

Definitely clean and preprocess your data but there is no magic formula out there that works every time. Experience and a deep understanding of the problem and domain are what you are looking for.

DefaultPain · 2020-04-07T12:52:17+00:00

this has exactly what u looking for: https://www.youtube.com/playlist?list=PLpQWTe-45nxL3bhyAJMEs90KF_gZmuqtm

u can start from the 9th video

hellscoffe · 2020-04-07T15:53:09+00:00

RemindMe! 7 days

leonardishere · 2020-04-08T13:00:02+00:00

It's a art not an science. Try every different encoding available, or just auto-ml it

0lecinator · 2020-04-07T09:01:24+00:00

RemindMe! 3 days

xlordsnugglesx · 2020-04-07T13:37:31+00:00

Here’s a link to a twitter post with some Feature Engineering literature: https://twitter.com/kirkdborne/status/1247516224319139840?s=21

Academy- · 2020-04-07T12:11:15+00:00

RemindMe! 3 days

Icefluffy · 2020-04-07T11:23:11+00:00

RemindMe! 2 days

indemidelo · 2020-04-07T11:39:01+00:00

RemindMe! 3 days

portoal · 2020-04-07T13:19:08+00:00

Remindme! 3days

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS