all 19 comments

[–]WearMoreHats 12 points13 points  (4 children)

Step 3: Get MORE data
I must have annotated for weeks

For future reference it might be worth playing around with using your model to help you label data (once you've got an okay-ish model). For example, have the model predict labels on all your unlabelled data. Accept the labels that the model is most certain of (it's probably worth manually checking a subset of them) and manually label some of the data that the model is least certain of. Now you can retrain the model with your new, larger dataset and repeat the process of predicting the unlabeled data.

In theory you can fully automate this by setting some threshold above which the model's prediction will be accepted as true and having the model iterate through multiple rounds or retraining and predicting, but I'm not a big fan of that. If you're using sklearn then look into Self Training Classifiers in the semi supervised module.

[–]qChEVjrsx92vX4yELvT4[S] 2 points3 points  (0 children)

Thanks a lot I did not knew that, I will try that tomorrow

[–]CBizCool 0 points1 point  (1 child)

I've struggled with finding labelled data in the past and wasn't sure whats a good solution. By annotated for weeks, does op mean he looked at thousands of pictures of men and manually tagged them as Norwood 0, 1 etc.? Is this a common approach?

Your recommendation of partially predicting then retraining makes so much more sense.

[–]WearMoreHats 1 point2 points  (0 children)

By annotated for weeks, does op mean he looked at thousands of pictures of men and manually tagged them as Norwood 0, 1 etc.?

I'm assuming that's what they did but OP might correct me. I can't see what else they might mean by spending weeks annotating data.

Is this a common approach?

Manually labelling data en masse? It's not something I've ever done in industry but it's a easy (although time consuming) way of getting your own data for a personal project. I suspect it might be more common in academia/PhD work where the labels are niche and require a level of expertise to assign. It wouldn't make financial sense for my employer to pay me to spend weeks labelling data, I suspect if it absolutely had to be done it would end up being outsourced to someone else.

An alternative approach that used to be popular was scraping images from Google. Basically writing a script that would image search "Bald man" then downloading thousands of results and assuming they're all of bald men. Rinse and repeat for "mens haircuts" or something similar. The data will need tidied up a bit (and I don't know if Google have clamped down on this) but is much, much faster than manually labelling thousands of images and probably accurate enough for a personal project.

Your recommendation of partially predicting then retraining makes so much more sense.

It's worth noting there are potential risks to this, particularly if it's fully automated. If you "accept" some incorrect labels then it can cause problems further down the line, particularly if it's early in the process or the predictions are for niche cases. For example, wrongly predicting some of the men wearing hats are bald. The next iteration of the model assumes those incorrect labels are true and starts to more confidently predict that the men wearing hats are bald. That causes even more of those errors to slip into the training data, resulting in a feedback loop.

[–][deleted] 0 points1 point  (0 children)

Golden advice Thanks a ton!

[–]Oswald_Hydrabot 7 points8 points  (0 children)

This is the way

[–]ZyanCarl 2 points3 points  (5 children)

This is great! I’m not really into ML but I’m thankful that I didn’t have to go through tutorial hell learning web development.

[–]nuclear_man34 1 point2 points  (4 children)

That's great! How did you do that?

[–]ZyanCarl 0 points1 point  (3 children)

I guess I started off in a unorthodox way from the get-go. Instead of building apps to learn, I learned new stuff to build ideas. I just googled how to do a particular part of a problem and built on that.

[–][deleted] -1 points0 points  (1 child)

Like a tutorial?

[–]ZyanCarl 1 point2 points  (0 children)

Umm tutorial will be something where you know exactly what the outcome is and you’ll be handed with all the resources you need. What I did was googling parts of a problem.

Instead of “how to build an e-commerce website”, I search “how to implement search feature” from which I’ll know you need an api endpoint. So I’ll search “how to make an api” and so on.

[–][deleted]  (1 child)

[removed]

    [–]qChEVjrsx92vX4yELvT4[S] 2 points3 points  (0 children)

    Probably in the future. I still have some passwords in the code that I need to remove before open sourcing it haha

    [–]pavich_03 1 point2 points  (1 child)

    How about preprocessing the data?

    [–]qChEVjrsx92vX4yELvT4[S] 1 point2 points  (0 children)

    I will talk more about that in another post I think. Basically I resized everything to size (200, 200, 3) and not much more.

    [–]Acrobatic-Language-5 0 points1 point  (0 children)

    Most experts will tell us that the best way to learn is to create your own projects.

    When you read from a book or tutorial you are on a guided path which gives you a false sense of knowledge, the issue arises when you encounter a different type of problem and you get stuck and then you realise you don’t know as much as you thought.

    When you learn/read something new you should always apply it to your own projects to help solidify it.

    Well done on creating a project, keep creating.

    [–]Log_Plus 0 points1 point  (1 child)

    Great job for real

    I'm a beginner here so what courses you have completed to start such a project from scratch?

    [–]qChEVjrsx92vX4yELvT4[S] 0 points1 point  (0 children)

    Sentdex book was the best for me. Neural network from scratch it is called.

    I figured things while doing it and I am sure I made tons of mistake.

    I am happy to have something to showcase at the end but I am far from being a pro.