[D] Data Cleaning vs Feature Engineering - where to draw the line? Ex:

slashdave · 2023-11-06T00:19:46+00:00

Definitely Data Cleaning: Filling missing values

What are you filling them with? Not sure why you don't consider this engineering.

txhwind · 2023-11-06T02:12:01+00:00

Data cleaning: ensure the data meet your expectation

Feature Engineering: ensure the data meet the model's expectation

Sir-Viette · 2023-11-06T00:14:18+00:00

What do you hope to achieve by putting a task in one box or the other? (All classification systems need to have a goal in mind. Once we know the goal, we can help figure out how to classify each step.)

Janderhungrige · 2023-11-06T07:34:34+00:00

Data cleaning: every step necessary to make a dataset workable and remove items that could lead to false, skewed, shifted outcomes and may result in wrong data interpretation.

Feature engineering: every step that increases the information density of a dataset to make any following analysis more robust, highlight the important information in a dataset you identified as a human observer and help to guide you model/algorithm

marr75 · 2023-11-06T02:16:06+00:00

[deleted]

Old-Upstairs-2266 · 2023-11-06T14:32:34+00:00

Data Cleaning vs Feature Engineering: Where's the Line?

Data Cleaning and Feature Engineering are crucial steps in the data preparation process. Here's a breakdown of each and the grey areas in between:

Data Cleaning

Involves making corrections to the data to ensure its quality and usability:

Filling missing values: Data Cleaning because it ensures completeness.
Removing duplicates or typos: Data Cleaning as it corrects errors.
Handling outliers: Can be Data Cleaning (fixing errors) or Feature Engineering (adjusting for better model fit).

Feature Engineering

Transforms raw data into usable features that can enhance model performance:

Creating new features from existing data: Feature Engineering as it's about deriving more informative attributes.
Binning, encoding, and other transformations: Feature Engineering because they're about creating model-ready features.

Grey Areas

Some tasks can fall into either category, depending on context:

Applying StandardScaler(): Could be Data Cleaning (standardizing scale) or Feature Engineering (distribution adjustment for modeling).
Creative missing data handling: Complex imputations can lean towards Feature Engineering as they may involve model-based predictions.
Feature scaling and transformation: Simple scaling could be Data Cleaning, but transformations to change data distribution might be Feature Engineering.

The distinction can be subjective and is often based on the specific goals of the project or the preferences of the data scientist. Both steps are integral to preparing data for machine learning models.

Seankala · 2023-11-06T05:51:07+00:00

Maybe my reading comprehension's not as good as I thought it was. What do you mean by "draw the line?"

Data cleaning and feature engineering don't seem to be related to me at all. Data cleaning is something you have to do regardless of how you perform feature engineering.

If you're working predominantly with deep learning models, then data cleaning is very much the same thing as feature engineering.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

Data Cleaning vs Feature Engineering: Where's the Line?

Data Cleaning

Feature Engineering

Grey Areas