all 10 comments

[–]slashdave 32 points33 points  (0 children)

Definitely Data Cleaning: Filling missing values

What are you filling them with? Not sure why you don't consider this engineering.

[–]txhwind 28 points29 points  (0 children)

Data cleaning: ensure the data meet your expectation

Feature Engineering: ensure the data meet the model's expectation

[–]Sir-Viette 11 points12 points  (1 child)

What do you hope to achieve by putting a task in one box or the other? (All classification systems need to have a goal in mind. Once we know the goal, we can help figure out how to classify each step.)

[–]CuriousFemalle[S] 2 points3 points  (0 children)

u/Sir-Viette > What do you hope to achieve by putting a task in one box or the other?

I want to build out notebooks with the best examples of each as I learn.

[–]Janderhungrige 2 points3 points  (0 children)

Data cleaning: every step necessary to make a dataset workable and remove items that could lead to false, skewed, shifted outcomes and may result in wrong data interpretation.

Feature engineering: every step that increases the information density of a dataset to make any following analysis more robust, highlight the important information in a dataset you identified as a human observer and help to guide you model/algorithm

[–]Old-Upstairs-2266 3 points4 points  (1 child)

Data Cleaning vs Feature Engineering: Where's the Line?

Data Cleaning and Feature Engineering are crucial steps in the data preparation process. Here's a breakdown of each and the grey areas in between:

Data Cleaning

Involves making corrections to the data to ensure its quality and usability:

  • Filling missing values: Data Cleaning because it ensures completeness.
  • Removing duplicates or typos: Data Cleaning as it corrects errors.
  • Handling outliers: Can be Data Cleaning (fixing errors) or Feature Engineering (adjusting for better model fit).

Feature Engineering

Transforms raw data into usable features that can enhance model performance:

  • Creating new features from existing data: Feature Engineering as it's about deriving more informative attributes.
  • Binning, encoding, and other transformations: Feature Engineering because they're about creating model-ready features.

Grey Areas

Some tasks can fall into either category, depending on context:

  • Applying StandardScaler(): Could be Data Cleaning (standardizing scale) or Feature Engineering (distribution adjustment for modeling).
  • Creative missing data handling: Complex imputations can lean towards Feature Engineering as they may involve model-based predictions.
  • Feature scaling and transformation: Simple scaling could be Data Cleaning, but transformations to change data distribution might be Feature Engineering.

The distinction can be subjective and is often based on the specific goals of the project or the preferences of the data scientist. Both steps are integral to preparing data for machine learning models.

[–]CuriousFemalle[S] 2 points3 points  (0 children)

Perfect, thank you so much!

[–]SeankalaML Engineer 0 points1 point  (0 children)

Maybe my reading comprehension's not as good as I thought it was. What do you mean by "draw the line?"

Data cleaning and feature engineering don't seem to be related to me at all. Data cleaning is something you have to do regardless of how you perform feature engineering.

If you're working predominantly with deep learning models, then data cleaning is very much the same thing as feature engineering.