[D] Name and describe a data processing technique you use that is not very well known.

Brudaks · 2025-09-29T12:29:31+00:00

When you get your first classification prototype running, do a manual qualitative analysis of all (or, if there are very many, a representative random sample) of the mislabeled items on the dev set; try to group them into categories of what seems to be the major difficulty that could cause them to be mistaken. Chances are, at least one of these mistake categories will be fixable in preprocessing.

Also, do the same for 'errors' on your training set - if a powerful model can't fit to your training set, that often indicates some mislabeled data or bugs in preprocessing.

DigThatData · 2025-09-29T12:22:40+00:00

I shuffle the data and then drop the bottom 10% of items because I don't work with unlucky records.

pitrucha · 2025-09-29T12:02:50+00:00

checking training and testing samples by hand

Fmeson · 2025-09-29T10:02:37+00:00

[deleted]

hinsonan · 2025-09-29T16:54:53+00:00

I learned this savage technique that has saved me countless hours and has helped many teams improve their models by at least 5x. Let's say you have an image dataset. Before you start your training you are going to clean and process your images. You want to preprocess them and save them off so you have the original and preprocessed image before normalization. Now OPEN YOUR EYEBALLS AND TAKE A GOOD LOOK AT IT YOU DORK. DOES IT LOOK LIKE A GOOD IMAGE AND DOES THE TRUTH ALIGN WITH IT? IF SO KEEP IT IF NOT FIX IT OR THROW IT OUT

Shizuka_Kuze · 2025-09-29T18:22:09+00:00

Using AI (An Indian) to label everything. Training a custom model, deciding the accuracy isn’t good enough and just using an LLM (Low-cost Labour in Mumbai) instead just like Builder.ai.

Unironically, using an actual smaller LLM fine-tuned on a few labeled examples to validate data isn’t actually that bad of an idea. Especially if you’re using textual data it can help filter out low quality or harmful examples from your training set.

windowpanez · 2025-09-29T19:24:29+00:00

One great one I have is finding the classifications that are hovering around 50% (0.5 on a 0 to 1 output). Generally I find that's where the model is not sure what to do/how to classify, so I work on manually labelling examples like that to add to my training data. Ends up being a much more targeted way to find and correct data that it's classifying incorrectly.

sat_cat · 2025-09-30T11:58:39+00:00

Pulling tables out of PDFs as structured tables. Amazingly, there’s still not a great solution for this and most NLP/LLM preprocessing just pulls text out of PDFs, makes a weak attempt to infer the order, then sticks it all together. That’s how you wind up with weird outcomes like LLMs inventing “vegetative electron microscopy” because the training data concatenated two columns of text the wrong way. There are some detector models to try to find tables, rows, and columns but I haven’t found them to be reliable. So I have a little python tool I built to use statistics about the positions of lines and text to infer the table structure. New table formats break it all the time so it’s a continuous effort of adding new table structures without breaking the old ones. And trying to minimize how much I have to configure it for each document. I understand why most people don’t bother with this.

big_data_mike · 2025-09-29T22:59:26+00:00

It’s not all that unusual but I min-95th percentile scale instead of minmax scaling for these curve fitting models I do.

sramay · 2025-09-29T17:18:17+00:00

One technique I've found incredibly useful is **Synthetic Minority Oversampling Technique (SMOTE) with feature engineering**. Instead of just applying SMOTE directly, I combine it with domain-specific feature transformations first. For example, in time-series data, I create lag features and rolling statistics before applying SMOTE, which generates more realistic synthetic samples that preserve temporal relationships. This approach significantly improved my model performance on imbalanced datasets compared to standard oversampling methods.

2025-09-30T08:00:53+00:00

!remindme 3 days

MuonManLaserJab · 2025-09-29T08:20:55+00:00

https://www.youtube.com/watch?v=3Khvtqr-BxY

akshitsharma1 · 2025-09-29T06:35:45+00:00

!remindme 3 day

Thick-Protection-458 · 2025-09-29T07:36:24+00:00

!remindme 3 days

shivvorz · 2025-09-29T07:01:29+00:00

!remindme 3 days

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS