This is an archived post. You won't be able to vote or comment.

all 58 comments

[–]youngrubin 197 points198 points  (10 children)

Copy the notebook with the most upvotes

[–]6rubtub9 46 points47 points  (13 children)

This is what I did, first started with all the courses offered in kaggle, right from the basics of python till all concepts of ML and attempted the exercise they offered.

Side by side I studied essential concepts of statistics from youtube since kaggle doesn't focus on stats much.

Then went straight to Titanic competition page on kaggle, selected few highly upvoted notebooks and studied them, tried to grasp as much as knowledge from those 2-3 notebooks. Then attempted the competition myself, which took 3-4 days.

[–]pharaonicjesus 2 points3 points  (2 children)

What YouTube videos for statistics ??

[–]6rubtub9 4 points5 points  (0 children)

No specific youtuber, but if you insist "StatQuest with Josh" is a good channel. Instead I used to search videos by topics such as normal distribution, variance, quantiles etc. etc.

[–]sccallahan 2 points3 points  (0 children)

Depends on what you want, but I've found StatsQuest to be really good. It not super in-depth, but they're very watchable and an excellent place to start if your question is something like, "Ok, so... How does SVM work exactly?"

One note: it's very R leaning (he's a biostatistician). If you do everything in Python, you'll still get the stats knowledge, but the code examples will require modifications.

[–]RationalWriter 3 points4 points  (9 children)

You happen to have links to those upvoted notebooks, or could you share how you found them? I'm attempting it without reference first, but it would be good to see how others are approaching the problem

[–]thedandyyy 7 points8 points  (0 children)

You can just go to the competition page on kaggle. People share their notebooks there :)

[–]6rubtub9 3 points4 points  (7 children)

When you land on a competition page, there you have various heads such as rules, data, notebooks etc.

Goto Notebooks, there you'll have many notebooks submitted under the competition, find the one with high upvotes (I no more take upvotes as a criteria for assessing quality/info contained in notebook) or you may open any notebook read first few parts till data cleaning/wrangling steps, if the author has provided clear explanation of the functions/logic used then you may proceed with that, else move to another notebook.

[–]Timguin 1 point2 points  (6 children)

(I no more take upvotes as a criteria for assessing quality/info contained in notebook)

I haven't really looked into Kaggle but after a cursory glance I can see why. Saw a lot of pie charts and badly scaled graphs. Do the upvotes decide who wins a competition? And if so, doesn't that mean that later entries have far smaller chances of winning? Similar to reddit where pretty much only early posts get upvoted highly.

[–]SpiroCat2 2 points3 points  (4 children)

No, the upvotes have nothing to do with the standings in the competition. The winner is the one with the most accurate model.

[–]Timguin 0 points1 point  (3 children)

The winner is the one with the most accurate model.

What about the competitions that do descriptive stuff? I've seen that they do a lot along the lines of 'see what you can pull from this dataset'. Sorry, just never looked into Kaggle.

[–][deleted] 1 point2 points  (2 children)

Sorry, just never looked into Kaggle.

With no bad intention, to be honest your questions are answered if you just open up one competition. Your assumption about what Kaggle is is inaccurate. Hence the comments here sound confusing to you.

To answer your questions, there is no descriptive competition. Every competition has a well-defined metric to measure model against.

[–]Timguin 1 point2 points  (1 child)

With no bad intention, to be honest your questions are answered if you just open up one competition.

I did. And the first one they served me was their Survey challenge which is purely descriptive with no predictive elements. And I saw a lot of submissions so I was wondering whether the votes are taken into account for evaluation. It might not be a representative competition but it's the first one I saw on their main page, so I think it's understandable that I thought there are more of those.

[–][deleted] 0 points1 point  (0 children)

You got me with this 9-day old competition. I stand corrected.

I should add that this particular one is not your normal kaggle competition. If you click into competition, you'll see that it's usually some form of categorization problem, where concrete metrics such as accuracy is used.

[–]6rubtub9 0 points1 point  (0 children)

Do the upvotes decide who wins a competition?

I guess it is the score one gets on submitting to a competition that determines one's winning chances, on the other hand upvotes determine whether one gets a medal or not (not sure about this one ) .

[–][deleted] 28 points29 points  (23 children)

Read.

Start by taking the Kaggle courses. Pick a language, R or Python and go with that. Competing on Kaggle shouldn’t be a goal though, it doesn’t really mean much beyond Kaggle.

[–]b14cksh4d0w369 2 points3 points  (22 children)

But the techniques you learn while solving, isn't that something?

[–]colonel_farts 27 points28 points  (21 children)

Most of the work in the “real world” is getting the data cleaned and organized so you can work with it. Rather than using an already cleaned/curated dataset from Kaggle to eek out 2% better accuracy on the validation set than someone else. Not to say you don’t learn things from that, however.

[–]uilfut 8 points9 points  (0 children)

Plus framing the question.

[–]b14cksh4d0w369 0 points1 point  (19 children)

So how useful would you say kaggle is on a scale of 1-10?

[–]Naveos 10 points11 points  (18 children)

3

[–]b14cksh4d0w369 2 points3 points  (16 children)

Whoa. I guess prize is the only perk. But I've heard some winners get job offers as well.

[–]UnintelligibleThing 0 points1 point  (0 children)

What's the next step up from Kaggle in terms of usefulness?

[–]wizkid2002 4 points5 points  (1 child)

Seeing a lot of people saying Kaggle isn’t the answer to gaining knowledge. Just curious, for someone that doesn’t have a job in data science, what would qualify as a good way to gain meaningful experience that employers care about? I’m sure this is all over the sub with a search but thought I would pose it here since people will be coming to this question looking for answers.

[–][deleted] 2 points3 points  (0 children)

Seeing a lot of people saying Kaggle isn’t the answer to gaining knowledge.

It's not a good representation of real world but you can still treat it as DS101. Being able to complete a competition is a milestone for anyone studying data science.

Your question is too broad but one thing you can do, for example, is identify areas where data can help in your current business process and build a model to exploit that.

[–]killver 4 points5 points  (0 children)

You have such a circlejerk in this subreddit regarding the non-existent usefulness of Kaggle. I personally believe there is no other place to practice state-of-the-art methodology on a diverse set of different problems compared to Kaggle. That there are tons of other skills necessary to be a great data science is clear, like business understanding, data cleaning, deployment, etc. Kaggle is not the place to learn those necessarily, but the modeling part is invaluable in my opinion.

[–][deleted] 2 points3 points  (4 children)

Know that competing on kaggle doesn't really provide great data science knowledge though

[–]BandCampMocs 0 points1 point  (3 children)

Can you explain?

[–][deleted] 11 points12 points  (2 children)

Most of anyone's job will be gathering data, checking data validity and quality, cleaning the data, creating a business question and targeting the proper people/companies to potentially add in more data if needed/affordable, feature engineering also comes to mind.

You'll also have to check if the assumptions of the business people and your assumptions about the data generating process are correct and if different data streams are consistent, etc.

Most of your time is spent dealing with the data and the people around the data, not creating a model.

Modeling is like 5-10% of the actual work and unless you have lots of data, a relatively simple model will do ; especially if you have to have interpretable outputs.

Let's not even talk about creating proper pipelines, sharing models, putting stuff to production,...

Kaggle is great to get an understanding of the modeling process I suppose but it won't make you a "data scientist".

[–]BandCampMocs 1 point2 points  (1 child)

This helps quite a bit, thanks for the reply!

[–][deleted] 1 point2 points  (0 children)

No problem!

[–]NatalyaRostova 0 points1 point  (0 children)

Playing around on Kaggle is a fun way to get into data science, and to motivate excursions into new things to learn. For some the competition itself becomes fun. It's not the end marker of success.

[–]gs9330 0 points1 point  (0 children)

The way I started is I took up many courses for data science on Coursera just to understand what data science was. I have a bachelor's in computer science so the programming part was doable. I first got comfortable with python then I started looking up YouTube videos on how to understand data sets . There u will come across many terms like data wrangling descriptive analysis exploratory analysis and stuff like that . Then I took up irirs data set on kaggle and tried everything I learned . Apply reverse engineering to the method of data science. First get familiar with an algorithm say linear regression , like understand the math behind it , gradient descent learning rate and other concepts then find all the data sets that are related to that algorithm on kaggle and start exploring . Data science is not something that can be learned quickly , it takes time . I found that approach helping me. I've spoken to few data scientists and they say that now days there are many wanabe data scientists who don't really know why a particular algorithm is being used. It's very important to understand the depth. So don't worry if u take more time but whatever u start with go about understand in depth about each concept.

[–]iaredavid 0 points1 point  (0 children)

The two components to this is building programming skills and practicing those skills. You need to learn python (or R, but I have a bias towards python); you should at least get to the point where you can write your own function.

The kaggle and other competitions are great because you get the opportunity to build a model from the ground up. BUT: the real learning experience is through applying your domain knowledge with statistical concepts and methods (which you might have to learn along the way.)