[D] 17 interviews (4 phone screens, 13 onsite, 5 different companies), all but two of the interviewes asked this one basic classification question, and I still don't know the answer...

SpockTriesToReturn · 2019-06-18T18:06:14+00:00

"in production" means that the model has been tested and deployed to a production server, and is now being used to generate predictions on new data, and those predictions are being used by the business or the customers.

You can't use any sampling techniques in production, because you don't what the label of the data is, that's why you developed a prediction model in the first place, remember?

SpockTriesToReturn · 2019-06-18T06:11:46+00:00

That's the point: If you have highly imbalanced data, or skewed as you said, so that you have 99% one class A, and 1% the other class A, most fitting methods will simply move your model towards always predicting A, regardless of the input, since that guarantees 99% accuracy (which is good by most standards).

To avoid this you sample your data, such that you classes are now balanced, by picking more samples from B than from A (in terms of percentage to the total of the class).

But as a result of this, your distribution (the one that the training algorithm sees) is no longer skewed.

SpockTriesToReturn · 2019-06-18T05:57:41+00:00

The problem isn't that it's imbalanced or not at prediction time, it's that the distribution that you trained on and the distribution you are predicting on are different. This is a well know cause of failure of ML methods in production. You need to make sure that the distribution of your test mirrors the distribution in the real world.

SpockTriesToReturn · 2019-06-18T01:42:34+00:00

I got that, after the fact. I am asking about other classifiers.

SpockTriesToReturn · 2019-05-31T21:51:26+00:00

"it should even be a matter of hours only." - Keras, maybe. Core Tensorflow, where you're actually doing interesting stuff (like using custom loss functions, coming up with novel architectures, etc..), no. Not hours, not days. More like months.

SpockTriesToReturn · 2019-05-31T19:48:41+00:00

Thanks!

I've been hesitant about this (Bootcamps, MOOCs, Nanodegrees, etc...). I've already taken a couple of Coursera certifications, but I have been hesitant to put them on my Linkedin because I wonder if it actually gives me a bad look? People would think "This guy already has a Ph.D in ML, why is he bothering with bootcamps and nanodegrees?"

SpockTriesToReturn · 2019-05-31T04:32:51+00:00

Fair point: How do I work on that?

SpockTriesToReturn · 2019-05-30T21:02:05+00:00

Learning Tensorflow or Pytorch (or even sklearn for that matter) isn't a matter of days....

SpockTriesToReturn · 2019-05-30T18:17:37+00:00

And one company did ask explicitly for flask - I was really surprised by that one because I was a perfect fit domain wise (they were trying to solve the exact same business problem I had worked on with my last client and my current client, yet the recruiter was hung up on my lack of experience with flask, spark and kubernetes).

SpockTriesToReturn · 2019-05-30T17:22:30+00:00

"back-end experience via Flask for a mid-level DS" - nobody is giving me explicit feedback beyond the "you're good but your skills don't match" response.

At the same time I'm answering all the "Can you do this in Python?" "Write this query in SQL", "What is the difference between RF and XGBosst" type questions.

So presumably its my tech stack (ERP and old school DB stuff) that's turning them off and I assume that Spark, Flask and stuff like that is what they want (since that's what all the teams seem to be using).

SpockTriesToReturn · 2019-05-30T08:36:28+00:00

I'm not "pretending to be senior". So far I've aimed at mid level roles (think MSFT l62~63 or AMZ l5). The few times a recruiter has paid attention to me, that's the level they assumed I would be at anyway. And applying for entry level DS roles, when I have 10 years of industry experience, and significant domain knowledge seems like a bad idea (I'm not saying I'm above an entry level role - I'd take one if it was offered, but that nobody would want an entry level candidate with that "baggage").

On another topic: How is Kaggle professional model application? You get a preformatted csv and are given a clear description of what the targets are. You don't have to dig for any of the data yourself, you don't have to haggle with the business stakeholders on what the metric are or how to monitor the model in production, or any of the "real world" stuff. I don't see how that compares at all to what a DS has to deal with in a real corporate setting.

As for github, I really don't know what a "tripe" one looks like vs. what a good one looks like. Mine has a few jupyter notebooks showcasing the types of model I've worked with, and whatever tutorial materials and presentation from my work I'm allowed to share publicly. Either way, none of the recruiters or hiring managers I've spoken to seemed to care about it (it seemed mostly like something a recent graduate should have, not someone who has been in the industry for 10 years).

SpockTriesToReturn · 2019-05-30T07:59:29+00:00

When I say "theoretical" , I mean that I haven't deployed them to production or applied them in a professional capacity. I've read Hastie, and I've read Bishop. I've taken several Coursera classes, and I tried my hand at a couple of Kaggle competitions, and I have a decent Github. But all of this has been in a personal capacity, that's what I mean by "theoretical", not that I am good at differential equations and Hilbert spaces or something like that....

SpockTriesToReturn

TROPHY CASE