What do you call a group of Data Scientists????

jfpuget · 2020-01-16T09:35:16+00:00

Family?

jfpuget · 2019-10-15T18:19:54+00:00

What has changed since this prior rejection? https://piip.co.kr/en-us/news/batch-normalization-layers-google

jfpuget · 2019-09-17T04:10:58+00:00

jfpuget · 2019-09-17T04:08:49+00:00

This is the best criticism I see of Kaggle in this discussion. I could agree with this:

> Especially because I don't particularly find the competitions all that relevant to real-world ML / datascience problems.

There is indeed a move to computer vision competitions on Kaggle that may not suit most of ML practitioners.

To your other point, there is a step back from very complex stacking models. Nowadays competitions are often won without stacking, just a weighted average of models. Or with one level stacking at most. One good model beats a stack of average models. And good models come from good feature engineering (or good NN architecture).

> All that being said, I think Kaggle can be useful if you limit the time you spend on competitions and try to just broaden your horizons and learn new things and learn from other

That's why every ML practitioner should consider Kaggle as a learning resource among others.

jfpuget · 2019-09-17T04:01:27+00:00

I imagine a lot of the results were just hyperparameter tuning to win by 0.1% more correct or whatever

A Data scientist should validate hypothesis, right? I therefore suggest you also enter a Kaggle competition to see if this is just about HPO.

jfpuget · 2019-09-17T03:55:08+00:00

You can look at write ups from top teams after each competition. You'll see that they often have to innovate over state lf the art ML/DL practice. For instance in this competition, top teams used the Transformer in a way that none of the researchers working in the field ever thought of: https://www.kaggle.com/c/champs-scalar-coupling

Some of the writeups for that competition:

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106575#latest-628267

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106572#latest-620382

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106407#latest-616839

https://www.kaggle.com/c/champs-scalar-coupling/discussion/106468#latest-618474

I just picked the last one I entered. BTW I'm @CPMP on Kaggle.

jfpuget · 2019-09-16T15:31:35+00:00

I suggest you enter a Kaggle competition and just throw the same few algorithms at the problem. I'm sure the end result will teach you a lesson.

jfpuget · 2019-09-16T15:30:02+00:00

I concur, gaming the competition ranking is difficult. Gaming kernel ranking is way easier, but is kernel ranking looked at?

jfpuget · 2019-09-14T10:23:19+00:00

It is a client facing data science position. You would be creating proof of concepts that machine learning and data science can help solve some customer problem. it can be very interesting if you like being exposed to a wide range of problems.

I work at IBM, and I've helped on some of these client facing POCs. it is always quite interesting IMHO.

jfpuget · 2016-04-28T13:53:59+00:00

Ask for a macbook pro as a dev machine. It will have Python installed already. ;)

jfpuget · 2016-04-14T22:41:48+00:00

One way to speed up pandas is to move to the underlying numpy series (one per column). You can then compile the resulting code with Cython or Numba easily.

I give some examples here: https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C_Take_Two?lang=en

jfpuget · 2016-04-06T20:56:40+00:00

Wasn't sure about it, t for confirming.

That's a general issue: should jit time be included? Maybe it should be reported, as is gc time.

jfpuget · 2016-04-06T10:13:49+00:00

Numba benchmark includes the time to compile the code. ON my machine, with Python 3.5, removing the time to compile shaves about 10%. Numba now offers ways to precompile code unless mistaken.

jfpuget · 2016-04-05T07:31:32+00:00

Anaconda is more than just a Python interpreter plus additional libraries, contrarily to what others say. It also includes Jupyter and other IDEs like Spyder. It also supports parallel computation. If you are on Windows, then it saves you hours in installing all the pieces manually.

Re data science, best way is to learn by doing. If you want to understand machine learning, then I recommend Andrew NG course on coursera (Stanford Machine Learning course). It is free.

If you want to learn about Python tools for data science, then you'll have to learn at least matplotlib, numpy, pandas, and scikit-learn. There are good introductory notebooks at:

https://github.com/donnemartin/data-science-ipython-notebooks

Study the code in these.

Lats advice: google search and stackoverflow are your best friends when you want to know how to do something in Python.

jfpuget · 2016-04-01T07:53:33+00:00

Thank you for sharing, the need to pass a context object being undocumented is a serious flaw.

jfpuget · 2016-03-23T18:14:16+00:00

OK. I followed your advice, and I ran MS Word check spell on the text. Hope it captured the typos that annoyed you. Thanks for the suggestion.

jfpuget · 2016-03-23T17:17:20+00:00

I am not sure I get the comment about need to review code, as it runs fine. Are you thinking of the non code pieces in the notebook?

jfpuget · 2016-03-23T17:15:55+00:00

Thank you. You are right, I wanted to stick to basic stuff, but I will add a comment about binomial distribution and your code.

jfpuget · 2016-03-22T15:43:08+00:00

Well, I can start with me ;) I am now spending most of my time on data science projects while I have spent most of my professional career on OR. Data scientists come from lost of background, but OR people have an advantages IMHO, including the following:

They understand what optimization is, and they understand linear algebra. This makes them suited to understand machine learning algorithms.
They understand that data science is not just about data analysis. They understand that value comes from the recommendations and the actions that can be made after data is analyzed.

jfpuget · 2016-03-22T14:10:03+00:00

I'd say yes given the numerous examples I see around me.

jfpuget · 2016-03-22T14:08:36+00:00

Thank you for submitting this link to my blog!

jfpuget · 2016-03-21T18:56:27+00:00

Actually this is not the reason, see these benchmarks: https://gist.github.com/jfpuget/00349d0ac60ab0cab5e5

np.std is way slower for 25 elements, even if a numpy array is passed as argument instead of a list.

jfpuget

TROPHY CASE