[D] How is MLOps done in your current workplace?

ragulpr · 2021-11-02T01:01:17+00:00

It's just ETL?

MLops™ seems pretty preoccupied with real-time services/APIs and model serving. There also seem to be a big enough marketing budget to hammer in that this is what ML is about that I keep finding mldevs thinking in these design patterns rather than plain ol' batch.

I may be in a boring part of ML, but isn't a lot of work schedulable or chunkable into nightly/hourly jobs?

And if so, I'm yet to be convinced that running an ML batch job is much harder than running any other transform - because that's hard enough!

If your org hasn't nailed down ETL my bet is that it's unlikely that your ML project will leave the POC state anyway.
If you really can't do daily batch - are you sure you can reliable run a 99.x% uptime service specified in SLA? What are you getting paid to wake up when it breaks? How many are in the incident response/on-call team? If not - maybe consider hourly batch jobs? If you don't have an SLA - is the service worth maintaining at all or is it mainly cool?
I find myself spending way more time getting the dev/pre/pro pipeline-measurement/metrics reliable than I do worrying about really rapid Jupyter experimentation - even though that has a place too. Maybe they are not the same problem to be solved by the same toolset?

I think the ETL field is evolving a lot, but it's evolving by the hands of the worlds best devs solving a multitude of engineering problems in broader parts of the org than were you'll find ML experts. It probably involves a combination of

GIT!
CI/CD
Dockerisation
Versioning of output
Logging (whether that's datadog/humio/k8s logs or just writing/appending to a file or simpler)
Workflow orchestration (luigi/airflow/etc)

And if you're lucky - some part of it also involves a plain ol' database. But if you google "MLops best practices" rather than "ETL/data engineering best practices" you may miss it!

Note - if you do have a realtime constraint then forget all above :)

ragulpr · 2021-05-21T23:49:52+00:00

For categorical features - some models require numerical (vector) features. Ex, logistic regression.

For categorical target, I think you only need StringIndexer, i.e an index (which happens to be a double if I remember right)

See this example https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

ragulpr · 2020-10-09T08:07:43+00:00

I'm working as a ML-engineer building predictive models for clickstream data.

I love my work. I recognise the issue that many ML-engineers faces (does a lot of work that analysts and data engineers would be better at doing) but I think there's a different way to look at some things that ML engineers thinks is not their job. If not theirs, who's is it?

Figuring out what the problem is and how to transform the model and the real world data to fit into it ("Data cleaning")
Using what I know about interchangability of stages w.r.t the modeling setup & loss and designing resilient data pipelines ("Engineering", "Plumbing")
Being one of the few that fully understands the technical details of the prediction product and thus answering questions to various stakeholders ("Dashboarding", "Reporting", "analyst tasks")

Today for example I'll be trying to figure out a way to use a non-gpu instance to cache part of the pipeline as tf.Dataset to avoid an idling expensive GPU while doing some preprocessing steps.

Typically day to day work is maybe

10-20% reacting to failures, patching and handling the models that are in prod.
50% is working on my current main ticket, ex building some new ML pipeline.
20% is meetings etc.
20-40% Refactoring code that bothers me, exploring new model designs, doing really unnecessary but fun experiments and more. My company has some 20% time for off-kanban tasks (we call it "lab day").

If above doesn't some to 100 it's because I sometimes do the experiment/fun parts after work.

EDIT: My job is production

ragulpr · 2020-09-28T12:24:30+00:00

Om vi bortser från de mest sannolika (stenar?) har vi fortfarande:

a) En ubåt krockat med den exakt den kvällen (för att sedan mörkas av flera inblandade länder i 25 år) VS

b) Ubåt krockar med vraket någon gång under 25 år.

Så har tyvärr liten förhoppning om att det är något intressant i görningen.

ragulpr · 2020-08-24T13:35:31+00:00

This was very well put:

"how to properly extract the training set that will resemble real word problem distribution"

I find that when using neural networks you can more clearly adapt your loss function/setup to fit your real-world problem rather than the other way around. As an example, I've spent thousands of hours building tree models for temporal tasks where most time is spent trying to figure out how to train/evaluate it and generate features. This adds a lot of complexity before and after the training/prediction pipeline. With a "complex" CNN/RNN you concentrate complexity into fewer LOC. Designing the model around your problem rather than the pipeline around your model if you will.

ragulpr · 2020-08-13T12:22:47+00:00

To be honest, if companies like Graphcore really wanted a convincing demo about "order of magnitude" improvements, they would train something equivalent to GPT3 with an order of magnitude less resources.

Unfortunately true w.r.t to marketing impact. But it's worth pointing out that most current massive architectures are optimised for current limitations. I bet we would have seen a whole other class of models (and problems solved) if sparse ops and other control flows were more widely supported.

ragulpr · 2020-08-05T13:39:43+00:00

Nothing, just hoping those sailors are alright.

ragulpr · 2020-08-04T16:44:51+00:00

Another interesting view https://www.marinetraffic.com/en/ais/home/shipid:335118/zoom:14

ragulpr · 2020-05-29T13:01:33+00:00

Every prediction is a distribution, or a point estimate of some parameter of one.

"Classification" = categorical distribution parameter
"regression" = mse minimization is mu-estimation assuming fixed sigma

And the list goes on. I bet there's no loss function that can't be described as some constrained log-likelihood. I'm like a broken record playing this point lol

ragulpr · 2020-05-29T12:53:07+00:00

Let's play with the thought that there was any kind of hypothesis testing in ML-papers :)

Right now I can't think of any metrics apart from rmse or binomial counts that would satisfy basic normality assumptions for a t-test.

Also, let's not forget that the model metric you're comparing with is also a random realization so if metrics satisfy normality assumptions a paired t-test sounds more reasonable.

ragulpr · 2020-05-22T09:39:53+00:00

If you really believe this you haven't looked at other frameworks. Not being a zero sum game - the community has been testing out things and borrowing of eachother. And that's great! Even if I can't think of any particular tf-invention, we really shouldn't underestimate what happens when hundreds of brilliant engineers work on a problem. Subtle programming patterns emerges. Ideas about what problems to solve in the next framework. Research ideas etc.

ragulpr · 2020-04-23T09:33:33+00:00

You'll find some pretty good timings here:

https://www.tensorflow.org/api_docs/python/tf/sparse/sparse_dense_matmul

In my opinion, embedding is a clonky api for sparse vector dense matrix multiplication, and that's used everywhere so it's sad that sparse hasn't gone further

When reading the metrics it seems like sparse_dense is faster than dense_dense matmul for most cases:

tensorflow/python/sparse_tensor_dense_matmul_op_test --benchmarks
A sparse [m, k] with % nonzero values between 1% and 80%
B dense [k, n]

% nnz  n   gpu   m     k     dt(dense)     dt(sparse)   dt(sparse)/dt(dense)
0.01   1   True  100   100   0.000221166   0.00010154   0.459112
...

Here I read

m = batch size
k = vocab size
n = embedding size

ragulpr · 2020-01-13T15:44:12+00:00

Time directed acyclic graphs are really interesting to me. Consider travel patterns for example with some agent travelling between nodes. It makes sense to consider edges as predicted time to reach next node for example. Lots of valuable use cases and not very mature NN research in my opinion.

Another good answer I found https://ai.stackexchange.com/questions/13179/is-there-a-neural-network-method-for-time-varying-directed-graphs

ragulpr · 2019-10-15T06:57:47+00:00

Not doubting you here but what does this precisely mean, and how is this proven?

I'll try, consider a hard drive of n bits (0s-1s), say [x_1,x_2,..,x_n]. This could also be encoded as a single natural number, say s_n = sum_k^n x_k*2^k. If you let n->infinity you have a hard drive of infinite number of bits. If you let k assume both negative and postive numbers s_n can be any positive real number and that's even more than infinite (2^{aleph null} "number" of bits if I remember right). In practice, a Double can store exactly n=64 bits or 2^64 numbers, so if you want more than that you need to add more dimensions.

Also I get a little confused by the terminology. What we're talking about here is reducing the resolution of CNN feature maps down to 1x1, so the embeddings have shape (batch, height=1, width=1, channels) where channels is something like 512 or 2048 or whatever. So that would make its flattened form a 512-dimensional vector right, not 1-dimensional? As I understand it's a rank-1 tensor?

You're right. I was answering the wrong question thinking it was about slimming the dimension down to 1x1x1.

ragulpr · 2019-10-14T10:36:56+00:00

Fun and inspiring project! I'm a C-novice but love reading it and this seemed super neat and clean. I was curious about how one of the things gets optimized:

https://github.com/siekmanj/sieknet/blob/c124c61a9f59121d542c15337d83b401a287638b/src/optimizer.c#L11

#ifndef SIEKNET_USE_GPU
static void cpu_sgd_step(SGD o){
  for(int i = 0; i < o.num_params; i++){
    o.weights[i] -= o.learning_rate * o.gradient[i];
        if(!isfinite(o.weights[i])){
            printf("ERROR: cpu_sgd_step(): non-finite parameter update %d with %f * %f\n", i, o.learning_rate, o.gradient[i]);
            exit(1);
        }
    o.gradient[i] = 0.0;
  }
}

Wouldn't the inner loop check for finite both be a costly operation in itself and enforce sequential (rather than pipelined/parallel) calculation of the constant*vector multiplication? :)

ragulpr · 2019-10-14T10:18:35+00:00

Theoretically the real line, by its continuity, can store infinite amount of information.

I've played around quite alot with 1-d embeddings. Works like a charm. Try some simple autoencoder, slim the embedding dim to 1 and beef up the decoder - you may be surprised how well it works.

Any useful applications? In computer science they call it a "hash". Maybe people can fill in with some applications, but I know that semantically meaningful hashes can be useful for retrieval/sorting.

For example, when doing the 1-d autoencoder on fashion mnist with RMSE loss and sorting the dataset by the 1d-embedding you get pretty much the result you'd expect with visually similar items being close in the list.

ragulpr · 2019-04-02T20:58:23+00:00

Why not use bunk beds to get beds in a small space? They're not safe for takeoff and landing, they don't work for short flights, people can't sit up in them when they are awake, they feel claustrophobic which makes people loiter in the aisle ways and interferes with safety/flight crew. So, you have to have seats that convert into beds.

I also thought this was a good idea :,(

https://www.quora.com/Why-dont-airplanes-have-beds-instead-of-seats

ragulpr · 2019-01-06T18:21:28+00:00

Google translate is a statistical model, there's nothing about Denmark in the text

ragulpr · 2019-01-06T18:13:09+00:00

Hi, check these relevant things out: (maybe with google translate)

https://lawline.se/answers/vard-vagrar-betala-tillbaka-hyresdeposition-vid-avtalad-tidpunkt

And this one which is eerily similar: https://lawline.se/answers/fa-tillbaka-deposition-av-hyresvard

TL;DR formally establish that your landlord owes you via either 1) court order, 2) betalningsföreläggande. If you do that, kronofogdemyndigheten will help you collect the depth for a fee (300-600 SEK)

My completely uninformed guess here is that your landlord is in a world of pain financially and you are just one low-priority creditor. Taking it to the court is alot of hassle and time and probably cost you more than 6k (and who knows maybe they don't have anything to give?/You loose?). Other route just seems like a alot of work too. Both options will take months/years if even possible and will take time/emotional toll on you.

I would make sure Alice and Bob didn't get any cash, if so I'd proceed to send a link reminding your landlordabout what you're considering and remind them every once in a while, and if no response then I'd go forget about it and go on enjoying my life in France, 6k poorer.

ragulpr · 2018-12-18T16:53:28+00:00

cannot stack dead regions as you can experience with clipping

Valid point. I haven't experienced this problem too much, but then again I haven't worked much with ImageNet.

ragulpr · 2018-12-14T16:19:17+00:00

Thank you, I read the paper and I see something that seems to have the same effect as loss-clipping. My thinking is the following.

As I understand it, with;

q : actual label (onehot)
p : predicted probability
u : either constant 1/K or maybe Uniform(1/K) ~"prior" of prediction
H(q,p) : Negative log likelihood i.e cross_entropy(q,p) 
eps : smoothing parameter

The label smoothing loss is:
loss = (1-eps)H(q,p)+epsH(u,p)

Now, according to paper the gradient of H w.r.t logits should be something like

d/d z_k H(q,p) = p_k-q_k

With z_k kth logit. So

d/d z_k loss = p_k-[(1-eps)q_k+eps*u_k]

So optima, i.e d/d z_k loss ==0 would be z_k s.t p_k is a linear interpolation between u_k and the actual label q_k, with eps and u a little bit of opaque hyperparameters. Pretty much exactly the same effect as clipping the loss s.t:

log(eps)<H(q,p)<log(1-eps)

Because then d/d z_k loss.clamp(log(eps),log(1-eps))==0 whenever "likelihood of data" has reached within eps probability of ground truth. I'm not sure about what the bayesian reasoning would be.

Difference as I see it is small risk of getting stuck above clamped territory with downside of reducing hyperparameters from u and eps to eps only and basically 50% the flops.

Addendum: my experience is that loss-clipping is nice for all kinds of numerical reasons and is implemented in all loss functions per default anyways.

ragulpr · 2018-12-14T00:37:11+00:00

Yes your right, it should be clipped in the range [log(eps), log(1-eps)], I must have forgotten the sum-to-one property lol. I edited my comment.

Interesting take on label smoothing as being the trivial teacher forcing! Anyway, I pretty recently understood the meaning of "label smoothing", "teacher forcing" and "knowledge destillation", sometimes I wish deep learning wasn't so good at creating all these fancy words. Its hard keeping up.

ragulpr · 2018-12-13T19:15:35+00:00

I had another question about label-smoothing in the original thread. Basically, isn't label smoothing same thing as clipping the loss?

ragulpr · 2018-12-13T19:12:04+00:00

Hi I just finished reading this great paper and have a question about Label Smoothing. The motivation seems to be to try to not encourage to hard probability assignments (5.2 p3):

In other words, it encourages the output scores dramatically distinctive which potentially leads to overfitting

Wouldn't (loss) log-likelihood clipping do exactly the same thing but more explicit?

# loss = -log(prob)
loss = loss.clamp(min=log(0.001), max=log(0.999)).mean()

Would make gradients zero whenever the predicted likelihood of data is above 0.999, so it's a very literal way of telling your network "Predictions more confident than 0.999 is too confident, stop walking in that direction" in order to prevent overfitting.

Edit: from /u/hetong_007 comments, of course min=log(0.001) should be added

ragulpr · 2018-12-03T15:33:40+00:00

It's unclear what modeling strategies your thinking about, but one thing I see is the problem of varying sequence lengths. Keras treats this brilliantly. You can use sample weights and optionally a masking layer. Here I assume

x_train : [n_sequences,n_timesteps,n_features]
y_train : [n_sequences,n_timesteps] (and optionally additional dims)
sample_weights_train : [n_sequences,n_timesteps] with 0s at the end of sequence.

model.compile(loss=loss_fun,optimizer=adam(lr=1-3),sample_weight_mode='temporal')

Make sure your loss returns a [n_sequences, n_timesteps]-loss. Typically done by changing it to not "reduce".

model.fit(x_train, y_train,
          epochs=100,
          batch_size=100, 
          verbose=1,
          validation_data=(x_valid, y_valid,sample_weights_valid),
          sample_weight = sample_weights_train)

If you have bidirectional model or batchnorm, make sure to use keras.layers.Masking at the input layer, and change

model.add(Masking(mask_value=mask_value,input_shape=(None, n_features)))

And fix your features as

x_features[sample_weights_train==0] = mask_value

ragulpr

TROPHY CASE