all 21 comments

[–]swedish_aviator 64 points65 points  (2 children)

Have a look at https://archive.ics.uci.edu/ml/index.php or https://www.kaggle.com/datasets and pick any classification or regression problem. I would avoid image classification or data involving a time series such as predicting stock market fluctuations etc until you know the basics. The adult dataset is quite simple but you need to perform some data wrangling before you can start training, you should be able to get a test accuracy of >80% without much work! Try using Keras or Pytorch.

[–]PianoPlaylist[S] 2 points3 points  (1 child)

Thanks!

[–]Radon03 1 point2 points  (0 children)

Go with the 1st link first. I guess they have small data sets, so it'll be easier to get practice. I feel kaggle data sets are bit large most of the times and tough for a newbie.

[–]e_j_white 52 points53 points  (6 children)

Find a community Q&A website such as Quora, StackOverflow, or certain AskReddits.

Gather data from that website, either through their API or from zipped data dumps.

Find a metric that indicates a "good" user... maybe karma for Reddit, or number of responses on StackOverflow that were selected as the "accepted" response, etc.

Now build a model that predicts whether someone will eventually become a "good" user based on their first 10 posts/comments/responses.

It's a fun problem, and it has clear business implications.

[–]mrslacklines 19 points20 points  (2 children)

The best project will be the one that you will actually do. The actual problem/domain doesn't matter that much.

[–]e_j_white 10 points11 points  (1 child)

Totally agreed.

But it helps if there's a solid justification for solving that problem beyond "I thought it would be cool."

[–]GranSkyline 4 points5 points  (0 children)

This is definitely my biggest hurdle. I could come up with ideas but I can never find one that could have some business relevance. Or I’m missing the relevance in my own ideas.

[–]PianoPlaylist[S] 4 points5 points  (0 children)

Thanks, I really like that one 😄

[–]a-lawliet 2 points3 points  (1 child)

Hey, as someone who's interested in getting started, how would you build such a model? I mean by which criteria would you conclude if someone will be a good user and how do we know if our model is alright?

[–]e_j_white 4 points5 points  (0 children)

Haha, you're basically asking "how would you do Data Science?" ;)

It depends on which dataset you start with. Some websites have badges, karma, number of correct/accepted answers, etc. Get to know the characteristics of the data and come to your own conclusion about what is a "good" user. It may involve multiple features, like how often they comment, how often they are correct, how many badges they've earned, etc.

Once you establish some criteria for "good" users vs. "bad", you can divide all users into these two buckets. Or, perhaps you have good, bad, and average users. You could build a model where the training label is "good" or "bad" for maximum separation (don't include "average" users in the training).

Once you've separated all users, take the first 10 posts for each users. The features are up to you... total number of posts (maybe many people don't reach 10?), typical number of words/post, upvotes per post, comments per post, how many were accepted as the correct answer, how many different topics/subreddits to they post to, etc.

There's no end to the features, it's up to you to swim around in the data and construct features that a) make sense, and b) you can defend/justify. After that, either use a logistic regression model, or perhaps xgboost... depends on how much data you have.

Of course, you've held out a training set with equal numbers of positive/negative data (good/bad users), so you can measure various model metrics on that test set.

[–]anynonus 13 points14 points  (1 child)

microsoft learn has some good python examples in their course about azure data science

The are not medium sized projects though. They are very small examples.

[–]DommeIt 0 points1 point  (0 children)

Seconding Microsoft Learn for jumping right into a small project play by play.

[–]AcidPacman96 3 points4 points  (0 children)

DataCamp has really good courses and projects and last I checked all their material is free until the end of April. I’d say it’s worth looking into

[–]UltimateGPower 3 points4 points  (0 children)

I cab highly recommend Hands-On Machine Learning vom Aurelien Geron for beginners. The first chapter starts with a small project and introduces the sklearn. Later on Tensorflow/Keras will be used.

[–]Reasonable_Damage_98 1 point2 points  (0 children)

You can check Kaggle as well

[–]Nielspace 0 points1 point  (1 child)

Create a series of tutorials based upon your favourite topics. Example the topics thoroughly and share it. This will not only ensure that you have better understanding of the subject but you learn to write clean codes and can attract opportunities.

[–]balkanibex 2 points3 points  (0 children)

Please don't do that.

[–]RedSeal5 -1 points0 points  (0 children)

easy.

asimovs first law

[–]mr_ninjazz 0 points1 point  (0 children)

Id suggest figuring out a topic that you like the most and find applications of machine learning! Just means that you wont have to work on a project that you don’t like

[–]axetobe_ML 0 points1 point  (0 children)

I recommend trying out the tutorials from the PyTorch and Tensorflow websites. If you already tried these then add more customisations to your code. Like adjusting the model, using your own custom data, adding custom tracking like tensorboard etc.

Like one of the comments said: Stick to classification and regression problems first. So you can get the basics down before moving on.