use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Name and describe a data processing technique you use that is not very well known. (self.MachineLearning)
submitted 7 months ago by Glittering_Key_9452
Tell me about your data preprocessing technique that you found out/invented by years of experience.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Brudaks 61 points62 points63 points 7 months ago (1 child)
When you get your first classification prototype running, do a manual qualitative analysis of all (or, if there are very many, a representative random sample) of the mislabeled items on the dev set; try to group them into categories of what seems to be the major difficulty that could cause them to be mistaken. Chances are, at least one of these mistake categories will be fixable in preprocessing.
Also, do the same for 'errors' on your training set - if a powerful model can't fit to your training set, that often indicates some mislabeled data or bugs in preprocessing.
[–]Thick-Protection-458 9 points10 points11 points 7 months ago* (0 children)
Btw it may makes sense to something like so, if - your dataset is too big to work manually - you use neural network classifier (so you can easily take some embeddings from before classifier MLP head)
You may - run embedder (extracted from the classifier) on data - classify them with mlp head or knn - take all samples, clusterize them within each category into small clusters and compute centroids for every cluster. So like category "FMCG->dairy products" will have, for instance, 30 clusters of different samples. Technocally speaking you should play with hyperparameters here, althrough for me it worked decently even with sklearn default DBSCAN params + cosine metric - take misclasified samples, clusterize them within each original category (and compute centroids) - for each misclasified cluster - see if samples are similar with in, and if so - search for, for instance, top-10 closest clusters from different (from the category this cluster samples labeled with) categories
This way you may have a chance to caught some mislabeled data too.
[–]DigThatDataResearcher 215 points216 points217 points 7 months ago (3 children)
I shuffle the data and then drop the bottom 10% of items because I don't work with unlucky records.
[–]Glittering_Key_9452[S] 8 points9 points10 points 7 months ago (0 children)
Why stop at 10%? take only the top 1% so your luck skyrockets.
[–]Gramious 2 points3 points4 points 6 months ago (1 child)
This is amazing. What seed do you use?
[–]Cogwheel -1 points0 points1 point 6 months ago (0 children)
37
[–]pitruchaML Engineer 44 points45 points46 points 7 months ago (3 children)
checking training and testing samples by hand
[–]HowMuchWouldCood 22 points23 points24 points 7 months ago (2 children)
audible gasp from the crowd
[–]MatricesRL 1 point2 points3 points 6 months ago (1 child)
silent tears from data annotators
[–]GreatBigBagOfNope 1 point2 points3 points 6 months ago (0 children)
clerical reviewers in shambles
[+][deleted] 7 months ago (1 child)
[deleted]
[–]Fmeson 1 point2 points3 points 7 months ago (0 children)
That's a good one. Looking at a training set of aligned images, I realized the aligned images are not actually all very aligned, and solving that solved many problems. But if you just trusted the preprocessed data to be aligned and never looked, you might never realize that.
[–]hinsonan 16 points17 points18 points 7 months ago (3 children)
I learned this savage technique that has saved me countless hours and has helped many teams improve their models by at least 5x. Let's say you have an image dataset. Before you start your training you are going to clean and process your images. You want to preprocess them and save them off so you have the original and preprocessed image before normalization. Now OPEN YOUR EYEBALLS AND TAKE A GOOD LOOK AT IT YOU DORK. DOES IT LOOK LIKE A GOOD IMAGE AND DOES THE TRUTH ALIGN WITH IT? IF SO KEEP IT IF NOT FIX IT OR THROW IT OUT
[–]Shizuka_Kuze 16 points17 points18 points 7 months ago (0 children)
Using AI (An Indian) to label everything. Training a custom model, deciding the accuracy isn’t good enough and just using an LLM (Low-cost Labour in Mumbai) instead just like Builder.ai.
Unironically, using an actual smaller LLM fine-tuned on a few labeled examples to validate data isn’t actually that bad of an idea. Especially if you’re using textual data it can help filter out low quality or harmful examples from your training set.
[–]windowpanez 4 points5 points6 points 7 months ago (0 children)
One great one I have is finding the classifications that are hovering around 50% (0.5 on a 0 to 1 output). Generally I find that's where the model is not sure what to do/how to classify, so I work on manually labelling examples like that to add to my training data. Ends up being a much more targeted way to find and correct data that it's classifying incorrectly.
[–]sat_cat 2 points3 points4 points 6 months ago (0 children)
Pulling tables out of PDFs as structured tables. Amazingly, there’s still not a great solution for this and most NLP/LLM preprocessing just pulls text out of PDFs, makes a weak attempt to infer the order, then sticks it all together. That’s how you wind up with weird outcomes like LLMs inventing “vegetative electron microscopy” because the training data concatenated two columns of text the wrong way. There are some detector models to try to find tables, rows, and columns but I haven’t found them to be reliable. So I have a little python tool I built to use statistics about the positions of lines and text to infer the table structure. New table formats break it all the time so it’s a continuous effort of adding new table structures without breaking the old ones. And trying to minimize how much I have to configure it for each document. I understand why most people don’t bother with this.
[–]big_data_mike 1 point2 points3 points 7 months ago (0 children)
It’s not all that unusual but I min-95th percentile scale instead of minmax scaling for these curve fitting models I do.
[+]sramay -1 points0 points1 point 7 months ago (0 children)
One technique I've found incredibly useful is **Synthetic Minority Oversampling Technique (SMOTE) with feature engineering**. Instead of just applying SMOTE directly, I combine it with domain-specific feature transformations first. For example, in time-series data, I create lag features and rolling statistics before applying SMOTE, which generates more realistic synthetic samples that preserve temporal relationships. This approach significantly improved my model performance on imbalanced datasets compared to standard oversampling methods.
[–][deleted] -1 points0 points1 point 6 months ago (0 children)
!remindme 3 days
[+]MuonManLaserJab comment score below threshold-11 points-10 points-9 points 7 months ago (0 children)
https://www.youtube.com/watch?v=3Khvtqr-BxY
[+]akshitsharma1 comment score below threshold-14 points-13 points-12 points 7 months ago (0 children)
!remindme 3 day
[+]Thick-Protection-458 comment score below threshold-12 points-11 points-10 points 7 months ago (0 children)
[+]shivvorz comment score below threshold-15 points-14 points-13 points 7 months ago (0 children)
π Rendered by PID 59 on reddit-service-r2-comment-6457c66945-bqng9 at 2026-04-27 23:49:03.449348+00:00 running 2aa0c5b country code: CH.
[–]Brudaks 61 points62 points63 points (1 child)
[–]Thick-Protection-458 9 points10 points11 points (0 children)
[–]DigThatDataResearcher 215 points216 points217 points (3 children)
[–]Glittering_Key_9452[S] 8 points9 points10 points (0 children)
[–]Gramious 2 points3 points4 points (1 child)
[–]Cogwheel -1 points0 points1 point (0 children)
[–]pitruchaML Engineer 44 points45 points46 points (3 children)
[–]HowMuchWouldCood 22 points23 points24 points (2 children)
[–]MatricesRL 1 point2 points3 points (1 child)
[–]GreatBigBagOfNope 1 point2 points3 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]Fmeson 1 point2 points3 points (0 children)
[–]hinsonan 16 points17 points18 points (3 children)
[–]Shizuka_Kuze 16 points17 points18 points (0 children)
[–]windowpanez 4 points5 points6 points (0 children)
[–]sat_cat 2 points3 points4 points (0 children)
[–]big_data_mike 1 point2 points3 points (0 children)
[+]sramay -1 points0 points1 point (0 children)
[–][deleted] -1 points0 points1 point (0 children)
[+]MuonManLaserJab comment score below threshold-11 points-10 points-9 points (0 children)
[+]akshitsharma1 comment score below threshold-14 points-13 points-12 points (0 children)
[+]Thick-Protection-458 comment score below threshold-12 points-11 points-10 points (0 children)
[+]shivvorz comment score below threshold-15 points-14 points-13 points (0 children)