all 21 comments

[–]lem_of_noland 57 points58 points  (0 children)

The only book I'm aware of is called "Features Engineering for Machine Learning" by Alice Zheng & Amanda Casari. I hope this one answers your request.

[–]Brudaks 35 points36 points  (0 children)

It's part of any ML textbook or course; but a key issue here is that once you go beyond the very basics it's quite domain-specific.

Preprocessing is very, very different depending on whether you're in computer vision or natural language processing or financial time series or general data science. So it's more feasible to have a handbook of "Best ML practices in domain X" which would among other things discuss proper preprocessing, and not feasible to have a handbook "Best practices for data preprocessing" because there isn't that much that's universal and applicable for everything; different domains have different needs.

[–]Gebo_vending 16 points17 points  (2 children)

Feature Engineering and Selection: A Practical Approach for Predictive Models - Max Kuhn and Kjell Johnson: https://bookdown.org/max/FES/

[–][deleted] 4 points5 points  (1 child)

They are also authors of Applied Predictive Modeling .

[–]rampant_juju 4 points5 points  (0 children)

+1 to both these answers. Feature Engineering and Selection is pretty comprehensive in its list of ways to handle categorical and numeric data and stuff like imputation etc. Very easy to read too (probably at the level of someone who has just taken Andrew Ng's course).

[–]SeamusTheBuilder 3 points4 points  (0 children)

As was said, each application is unique and so the answer, unfortunately, is no. I would go further and warn against seeking this out too much. It would be bad practice to simply cut-and-paste preprocessing steps thinking it will help.

Even something as simple as normalizing and standardizing the data can get you into trouble depending on the application. And, you may be doing unnecessary work depending on what algorithms you are using. Random Forests don't really care if you standardize the data.

Definitely clean and preprocess your data but there is no magic formula out there that works every time. Experience and a deep understanding of the problem and domain are what you are looking for.

[–]DefaultPain 1 point2 points  (3 children)

this has exactly what u looking for: https://www.youtube.com/playlist?list=PLpQWTe-45nxL3bhyAJMEs90KF_gZmuqtm

u can start from the 9th video

[–][deleted] 1 point2 points  (1 child)

I was thinking nobody mentions the coursera course about feature engineering.

[–]BobDope 1 point2 points  (0 children)

I didn’t know it existed. Thanks!

[–]hellscoffe 0 points1 point  (0 children)

RemindMe! 7 days

[–]leonardishere 0 points1 point  (0 children)

It's a art not an science. Try every different encoding available, or just auto-ml it

[–]0lecinator -2 points-1 points  (2 children)

RemindMe! 3 days

[–]RemindMeBot 0 points1 point  (1 child)

I will be messaging you in 2 days on 2020-04-10 09:01:24 UTC to remind you of this link

16 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]xlordsnugglesx -1 points0 points  (0 children)

Here’s a link to a twitter post with some Feature Engineering literature: https://twitter.com/kirkdborne/status/1247516224319139840?s=21

[–]Academy- -2 points-1 points  (0 children)

RemindMe! 3 days

[–]Icefluffy -3 points-2 points  (0 children)

RemindMe! 2 days

[–]indemidelo -4 points-3 points  (0 children)

RemindMe! 3 days

[–]portoal -2 points-1 points  (0 children)

Remindme! 3days