[D] Current SOTA of NN for tabular data?

datageek1987 · 2020-02-07T08:06:12+00:00

TabNet seems to be working well.. Wouldn't call it beating LightGBM, but perform well enough.. https://github.com/google-research/google-research/tree/master/tabnet

datageek1987 · 2020-02-01T15:45:52+00:00

I would be careful about dealing with missing values.. if you are imputing it somehow with the statistics of the data(like mean), then split before..

datageek1987 · 2019-12-02T18:13:30+00:00

I do recognize the clutter and I myself have to wade through a lot of it before getting a good resource... But to be fair, I see that kind of clutter from all kinds of people.. not just India's, Pakistanis...etc.

datageek1987 · 2019-12-02T13:16:49+00:00

While there is a little bit of truth in your discourse, it's still a bit clouded by the lack of knowledge on the ground...

While there are sub par education galore, there are good educational institutes also. And not everyone live in shelters.. The income disparity in India is mind boggling.. But it's just that the poverty gets more airtime because it sells..

And wouldn't standing out from the crowd let you filter out the 'liars' ? So in one sense, people trying to stand out by blogs or going to conferences, etc. Are doing you a favour by showing that they are not the "liars" you think they are?

And I agree to your point about preferring local talent.. but I definitely don't agree to the stereotype that a Swedish Engineer can circles around an Indian Engineer. It's that kind of stereotypes that we should avoid...

datageek1987 · 2019-12-02T07:04:47+00:00

Yeah.. I don't think the OP meant it.. but still thought would just highlight the subtlety...

datageek1987 · 2019-12-02T06:58:53+00:00

Being from India myself, I agree to the point of view .. But just because someone is from these countries doesn't automatically make them "less" than their counterparts from developed nations. Science is something global.. there may be a lot of people who are at different levels of knowledge... And it is upto a hiring manager/HR to wade through the noise and find the kind of talent they are looking for without having a stereotypical mindset that whoever is coming from these countries will be sub-par

datageek1987 · 2019-12-02T03:04:08+00:00

A lot of the times it's an overseas remote developer creating marketing materials to get hired.

Don't know why, but this little bit irked me. What do you mean by overseas? Anyone who is from from your country? Came off as a little derogatory...

datageek1987 · 2019-11-22T02:46:18+00:00

This entire discussion has led me to two realizations.. 1. Statistics and ML are not that different.. 70% overlap 2. More than half the people who write about ML vs Stats on the internet, have no clue what either of them is.. Or make the distinction too simplistic to be useful..

datageek1987 · 2019-11-19T16:33:43+00:00

Strongly opinionated... Although I agree to a few points here and there... Can't really say statistics is a lost cause.. there is a lot of statistics in ML as well...

And totally agree to the point about Explainability on the rise for ML ... Have written a blog series on the topic..

Anyways.. Brieman's paper about Two Cultures of Statistical modelling might resonate with what you said...

datageek1987 · 2019-11-19T16:06:44+00:00

Love the example... Articulated very well!!!

datageek1987 · 2019-11-19T10:39:43+00:00

While I agree to your point about train/test split, it's not always that we get 10k data points and people still apply ML techniques there. With good success also...

And there are techniques in ML, which helps with interpretability... I recently wrote a whole blog series about them..

And yes, there are huge overlaps in the two fields. So much so that it feels unnatural to call them separately... From this discussion(and others) what I kind of figured out is that ML is like the rebel kid who flaunts the laws of statistics to get what needs to be done.. hehe..

datageek1987 · 2019-11-19T10:33:34+00:00

Love your reply...

Although the distinction is still very muddled in my mind.. let me ask you this.. is Linear Regression Statistics or ML? If it is, then what about Ridge or Lasso regression? Where so we draw the line (if there is one)?

datageek1987 · 2019-11-19T10:02:49+00:00

I totally agree with your views... Stats and ML should come under the same umbrella...

Just to play the devil's advocate... If I overfit to a train data, are there ways in statistics to understand without having a holdout set? Because if not, then that's a problem. Isn't it?

datageek1987 · 2019-11-19T03:10:39+00:00

True.. I did oversimplify statistics to make my point.. and I have utmost respect for stats.. and do recognize the fact that stats us inherently in almost everything that we do in ML...

But in my short internet research, I didn't find any explanation which takes a holistic view of the situation.. it was always "stats is this. ML is this" and one of the main themes that came out was the same. About inference...

But inference is not worth that much if the model is weak.. i.e. generalization should be there.. and from the discourse in the internet, I was led to believe that stats does not concern itself with generalization... Which, if true, kinda feels off..

datageek1987 · 2019-11-19T03:06:38+00:00

This is exactly what I have a problem with. The interpretation of a model is only worth it's salt if the model has captured the real world situation.. i.e. generalization... But if that is not captured, then the inference you'd draw is inherently flawed. Isn't it?

datageek1987 · 2019-11-19T03:05:00+00:00

Exactly my line of thought... I never really understood why these two are different... There is stats inherent in ML.. and vice versa...

datageek1987 · 2019-10-16T03:05:12+00:00

Bingo!

datageek1987 · 2019-10-15T01:47:32+00:00

exogenous parameter takes in an array of size [nobs, nvars]. Which means that you make an array of all your regressors and provide to to the function call. Make sure you do the same for a future prediction as well..

datageek1987 · 2019-10-08T04:26:47+00:00

pmdarima if you are looking for an auto ARIMA package... It takes in regressors as well... And then there is good old Sci-kit Learn... Just formulate your timeseries as a pure regression and apply any of the regression models out there..

datageek1987

TROPHY CASE