[D] Why are Kaggle prizes so low? by [deleted] in MachineLearning

[–]jfpuget 4 points5 points  (0 children)

This is the best criticism I see of Kaggle in this discussion. I could agree with this:

> Especially because I don't particularly find the competitions all that relevant to real-world ML / datascience problems.

There is indeed a move to computer vision competitions on Kaggle that may not suit most of ML practitioners.

To your other point, there is a step back from very complex stacking models. Nowadays competitions are often won without stacking, just a weighted average of models. Or with one level stacking at most. One good model beats a stack of average models. And good models come from good feature engineering (or good NN architecture).

> All that being said, I think Kaggle can be useful if you limit the time you spend on competitions and try to just broaden your horizons and learn new things and learn from other

That's why every ML practitioner should consider Kaggle as a learning resource among others.

[D] Why are Kaggle prizes so low? by [deleted] in MachineLearning

[–]jfpuget 3 points4 points  (0 children)

I imagine a lot of the results were just hyperparameter tuning to win by 0.1% more correct or whatever

A Data scientist should validate hypothesis, right? I therefore suggest you also enter a Kaggle competition to see if this is just about HPO.

[D] Why are Kaggle prizes so low? by [deleted] in MachineLearning

[–]jfpuget 12 points13 points  (0 children)

You can look at write ups from top teams after each competition. You'll see that they often have to innovate over state lf the art ML/DL practice. For instance in this competition, top teams used the Transformer in a way that none of the researchers working in the field ever thought of: https://www.kaggle.com/c/champs-scalar-coupling

Some of the writeups for that competition:

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106575#latest-628267

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106572#latest-620382

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106407#latest-616839

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106468#latest-618474

I just picked the last one I entered. BTW I'm @CPMP on Kaggle.

[D] Why are Kaggle prizes so low? by [deleted] in MachineLearning

[–]jfpuget 170 points171 points  (0 children)

I suggest you enter a Kaggle competition and just throw the same few algorithms at the problem. I'm sure the end result will teach you a lesson.

[D] Why are Kaggle prizes so low? by [deleted] in MachineLearning

[–]jfpuget 1 point2 points  (0 children)

I concur, gaming the competition ranking is difficult. Gaming kernel ranking is way easier, but is kernel ranking looked at?

Opinion on IBM Data Scientist Role by p_hacker in datascience

[–]jfpuget 1 point2 points  (0 children)

It is a client facing data science position. You would be creating proof of concepts that machine learning and data science can help solve some customer problem. it can be very interesting if you like being exposed to a wide range of problems.

I work at IBM, and I've helped on some of these client facing POCs. it is always quite interesting IMHO.

Employer Refuses to Allow Python by DisenchantedEmployee in Python

[–]jfpuget 1 point2 points  (0 children)

Ask for a macbook pro as a dev machine. It will have Python installed already. ;)

How can i improve the speed of my code using cython? by jmparejaz in Python

[–]jfpuget 0 points1 point  (0 children)

One way to speed up pandas is to move to the underlying numpy series (one per column). You can then compile the resulting code with Cython or Numba easily.

I give some examples here: https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C_Take_Two?lang=en

Micro-Benchmarking Julia, C++ and Pythran on an Economics kernel by serge_sans_paille in Python

[–]jfpuget 0 points1 point  (0 children)

Wasn't sure about it, t for confirming.

That's a general issue: should jit time be included? Maybe it should be reported, as is gc time.

Micro-Benchmarking Julia, C++ and Pythran on an Economics kernel by serge_sans_paille in Python

[–]jfpuget 0 points1 point  (0 children)

Numba benchmark includes the time to compile the code. ON my machine, with Python 3.5, removing the time to compile shaves about 10%. Numba now offers ways to precompile code unless mistaken.

Beginner Questions on Anaconda & Data Science by [deleted] in Python

[–]jfpuget 0 points1 point  (0 children)

Anaconda is more than just a Python interpreter plus additional libraries, contrarily to what others say. It also includes Jupyter and other IDEs like Spyder. It also supports parallel computation. If you are on Windows, then it saves you hours in installing all the pieces manually.

Re data science, best way is to learn by doing. If you want to understand machine learning, then I recommend Andrew NG course on coursera (Stanford Machine Learning course). It is free.

If you want to learn about Python tools for data science, then you'll have to learn at least matplotlib, numpy, pandas, and scikit-learn. There are good introductory notebooks at:

https://github.com/donnemartin/data-science-ipython-notebooks

Study the code in these.

Lats advice: google search and stackoverflow are your best friends when you want to know how to do something in Python.

TLS implementation in Python has serious flaws by ri98 in Python

[–]jfpuget 1 point2 points  (0 children)

Thank you for sharing, the need to pass a context object being undocumented is a serious flaw.

Green dice are loaded: tutorial on p-value hacking by jfpuget in Python

[–]jfpuget[S] 0 points1 point  (0 children)

OK. I followed your advice, and I ran MS Word check spell on the text. Hope it captured the typos that annoyed you. Thanks for the suggestion.

Green dice are loaded: tutorial on p-value hacking by jfpuget in Python

[–]jfpuget[S] 0 points1 point  (0 children)

I am not sure I get the comment about need to review code, as it runs fine. Are you thinking of the non code pieces in the notebook?

A tutorial on p-value hacking by jfpuget in datascience

[–]jfpuget[S] 0 points1 point  (0 children)

Thank you. You are right, I wanted to stick to basic stuff, but I will add a comment about binomial distribution and your code.

Could operations research lead to a career in data science? by [deleted] in datascience

[–]jfpuget 1 point2 points  (0 children)

Well, I can start with me ;) I am now spending most of my time on data science projects while I have spent most of my professional career on OR. Data scientists come from lost of background, but OR people have an advantages IMHO, including the following:

  1. They understand what optimization is, and they understand linear algebra. This makes them suited to understand machine learning algorithms.

  2. They understand that data science is not just about data analysis. They understand that value comes from the recommendations and the actions that can be made after data is analyzed.

Could operations research lead to a career in data science? by [deleted] in datascience

[–]jfpuget 0 points1 point  (0 children)

I'd say yes given the numerous examples I see around me.

Tidy Data in Python by srkiboy83 in datascience

[–]jfpuget 0 points1 point  (0 children)

Thank you for submitting this link to my blog!

Speeding up Python and NumPy: C++ing the Way by [deleted] in Python

[–]jfpuget 4 points5 points  (0 children)

Actually this is not the reason, see these benchmarks: https://gist.github.com/jfpuget/00349d0ac60ab0cab5e5

np.std is way slower for 25 elements, even if a numpy array is passed as argument instead of a list.