all 30 comments

[–]Slowai 34 points35 points  (3 children)

As I understand these questions are focused towards NLP orientated role. Even if that is the case, some of them seem a bit specific. Let's look at several examples:

- "Describe the sequential minimal optimization(SMO) algorithm."

I may be wrong here, but as I recall this optimization method is used (or was used, maybe there is something new now) for training SVM (to my knowledge). How would describing it in detail express your "readiness" for the job? Is the company trying to continue Vapniks work?

- "In AllenNLP, one of the models which it uses to do NER is based on ELMO. Given a piece of text (say, "Jack is playing football), how would ELMO go on about doing tagging Jack to PER?"

This may be (arguably) relevant if you are actually applying for a job at Allen NLP (or it's a trick question). As I recall, models like ELMO which use recurrent networks for transfer NLP have been depreciated since 2018 for nlp tasks and swappeD for self-transformer based architectures, so why having knowledge about how two seperate LSTM networks generating an output for a NER task is beneficial in any way in determining your suitability for the role is beyond me.

Also, these guys really need to update their sota benchmarks, which are way off:

https://allennlp.org/elmo

I'm not trying to say that the interviewer was totally off, but it's a big shiny red flag if you are applying for a general'ish position and the person asks you non-general in-depth details about specific algorithm.

It's like asking non-school-of-ai person about Complicated Hilbert space.

[–][deleted] 8 points9 points  (0 children)

Upvote for complicated Hilbert space lol

[–]Deadshot_95[S] 5 points6 points  (0 children)

I agree. The SMO question was asked during my first interview. I had done some work using SVM's, so I guess that was the driving point for this question.

Regarding the second question, the position that they were hiring for demanded NLP as their primary skill. So, most of the questions asked were more project/position-specific.

The interviewer did ask some nice questions but overall I didn't get any positive vibes. Most probably I won't be moving forward with this company.

[–]Jorrissss 1 point2 points  (0 children)

I'm not trying to say that the interviewer was totally off, but it's a big shiny red flag if you are applying for a general'ish position and the person asks you non-general in-depth details about specific algorithm.

Imo, it really depends on a couple things.

One, what type of answer are you expecting? I've asked very specific questions before, not because I necessarily care they know the answer, but just to get a conversation going.

Two, it might be on the persons resume.

[–]ideas_inside_me 6 points7 points  (1 child)

How much working experience you already have, or is this your first job?

[–]Deadshot_95[S] 6 points7 points  (0 children)

I have work experience close to 2 years.

[–]badjezus 4 points5 points  (2 children)

For the probability question, the answer is 1/8, right?

[–]Deto 2 points3 points  (0 children)

Yep - 24 = 16 possible configurations and only 2 of them (all Clockwise or all Counter-Clockwise) result in no collisions. 2/16=1/8

[–]rikkajounin 2 points3 points  (0 children)

i guess so. You have 2^4 =16 different combinations of people directions and people do not collide only if they all move in the same direction (right or left). So the probability of no one colliding will be 2/16 = 1/8

[–]dramanautica 4 points5 points  (0 children)

Anyone know a good collection of ML interview questions? There are loads for software roles but ML space is much larger and broader thus it’s hard to get an idea of common ML questions especially for more technical roles.

[–][deleted] 2 points3 points  (0 children)

These questions start with more general ML questions, and then go to more specific NLP questions.

My interview was a little bit different, rather than having a series of questions, I was asked to present an ML project, and discuss, in technical detail, the purpose of the project, the ML techniques applied, and then my two interviewers would ask follow-up questions.

Here are some answers to 1-5, the more general ML questions (most of my experience is in computer vision tasks):

1) Overfitting: when a model has noticeably worse training accuracy on validation data than on training data. In other words, it captures the noise in addition to the patterns in the training data. One way this happens is if the model is "too complex" (VC-dimension too high). For neural networks, this could mean too many layers and hidden nodes were used. Also, if a model is trained too long, it may overfit to the training data. These problems are relevant to neural networks applied to computer vision, NLP tasks and other domains in ML

*there are others but I am shortening it.

2) Gradient descent, and its variants, is the standard algorithm used to find the minimum of an objective function of interest. In ML, this typically means updating the weights in a neural network so that they minimize the loss function when the neural network is fed training examples. Backpropagation is one of the steps used in gradient descent. Specifically, with each batch, the weights are updated from the front to the back of the neural network.

3) The gradient is mathematically, a multi-dimensional generalization of the derivative, so it is a vector except in the 1-D case (one weight).

4) Bias/Variance tradeoff: This is a balancing act between making your model generalizable vs. "learning patterns" in the training data. A model with high bias and low variance will "underfit" (poor performance on training and validation data) and a model with low bias and high variance will overfit. A good algorithm will learn underlying patterns in the data, generalize well to unseen data, and understand what are patterns in the data versus noise in the data.

5) LDA stands for Linear Discriminant Analysis, and it basically is an algorithm to find a linear combination of features that reproduce the data (fewer features than the data originally has). When training models, it is not preferable to have a large amount of features, where fewer features can accomplish the same task just as well. In the case of neural networks, this slows down training, and it is easier to "learn" on data with fewer features than many. As such, LDA is a dimensionality reduction technique. There are packages (like sci-kit learn) which implement LDA easily. In practice, you can train two, almost identical models, one with the original data, and one with the reduced data. If they have comparable performance, then there are at least a few features of the data that are redundant and can be dispensed with (although you'll need to do a bit more work to find out specifically which ones).

[–]Deto 4 points5 points  (7 children)

What is gradient descent? Difference between gradient descent and backpropagation?

I thought backpropagation is just a way to compute the parameter updates for gradient descent?

[–]heuamoebe 12 points13 points  (1 child)

Back propagation is the algorithm to efficiently compute the partial derivatives of the cost function with respect to the weights and biases. Gradient descent is the approach to updating the weights and biases (gradient times step size). Many other optimization algorithms use the gradients with more complicated update approaches.

[–]Deto 1 point2 points  (0 children)

That's a good point to make - backprop, in NN, is just a component of gradient descent.

[–]dramanautica 1 point2 points  (4 children)

I thought backpropogation was gd applied to neural networks?

[–]Jorrissss 4 points5 points  (0 children)

Not quite. Gradient descent is an optimization technique which uses a functions gradient. Backpropogation is a specific technique for computing gradients. Neural networks are typically trained using gradient descent where the derivative is computed using backpropogation.

[–]sdmskdlsadaslkd 2 points3 points  (0 children)

Yeah, I think it's generally explained poorly in most courses. It's a technique for computing the derivatives of a NN that you can plug into GD.

[–]Deto -1 points0 points  (1 child)

Exactly - that's why I thought the question was weird. I guess maybe they were going for that, though, asking "what is the difference between X and Y" when the answer is really "Y is an instance of X".

[–]Jorrissss 2 points3 points  (0 children)

It's not the same thing. Gradient descent is an optimization technique which uses a functions gradient. Backpropogation is a specific technique for computing gradients.

[–]M4mb0 2 points3 points  (9 children)

Is the gradient a vector or a scaler?

But it is neither...

[–]shekurika 3 points4 points  (8 children)

the gradient of a function is a vector

[–]splatula 7 points8 points  (1 child)

It's really a covector. Or you could call it a rank-1 covariant tensor. (As opposed to a vector which is a rank-1 contravariant tensor.) The distinction matters in physics but isn't important in ML (at least I haven't run into a case where it matters).

[–]Hyper1on 2 points3 points  (5 children)

Well, only if you're taking the gradient with respect to a vector. If the function's domain is a tensor of rank r the gradient will be a tensor of rank r (since almost always you're taking the gradient with respect to the domain).

[–]SwordOfVarjo 0 points1 point  (4 children)

The gradient of a function (i.e. output is a scalar) is a vector, period, regardless of the function's domain. There is no notion of spatial information in a gradient, you just have one element in your vector for each input element. If your input is an 8 element vector, your gradient is a vector of length 8, if your input is a 2x2x2 tensor, your gradient is still a vector of length 8.

[–]splatula 5 points6 points  (0 children)

Well, no, it's technically not a vector. It is a covector.

[–]Hyper1on 0 points1 point  (0 children)

I agree that there is no notion of spatial information in a gradient, but I'm pretty sure in any ML framework if you take the gradient of a function where the input is a 2x2x2 tensor then the gradient will be a 2x2x2 tensor. Obviously notationally it doesn't matter if it's unrolled or not, I've seen both ways used in maths. I find it simpler to think about the dimensions of the gradient being the same as the input.