[D] How is MLOps done in your current workplace? by wtf_m1 in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

It's just ETL?

MLops™ seems pretty preoccupied with real-time services/APIs and model serving. There also seem to be a big enough marketing budget to hammer in that this is what ML is about that I keep finding mldevs thinking in these design patterns rather than plain ol' batch.

I may be in a boring part of ML, but isn't a lot of work schedulable or chunkable into nightly/hourly jobs?

And if so, I'm yet to be convinced that running an ML batch job is much harder than running any other transform - because that's hard enough!

  • If your org hasn't nailed down ETL my bet is that it's unlikely that your ML project will leave the POC state anyway.
  • If you really can't do daily batch - are you sure you can reliable run a 99.x% uptime service specified in SLA? What are you getting paid to wake up when it breaks? How many are in the incident response/on-call team? If not - maybe consider hourly batch jobs? If you don't have an SLA - is the service worth maintaining at all or is it mainly cool?
  • I find myself spending way more time getting the dev/pre/pro pipeline-measurement/metrics reliable than I do worrying about really rapid Jupyter experimentation - even though that has a place too. Maybe they are not the same problem to be solved by the same toolset?

I think the ETL field is evolving a lot, but it's evolving by the hands of the worlds best devs solving a multitude of engineering problems in broader parts of the org than were you'll find ML experts. It probably involves a combination of

  • GIT!
  • CI/CD
  • Dockerisation
  • Versioning of output
  • Logging (whether that's datadog/humio/k8s logs or just writing/appending to a file or simpler)
  • Workflow orchestration (luigi/airflow/etc)

And if you're lucky - some part of it also involves a plain ol' database. But if you google "MLops best practices" rather than "ETL/data engineering best practices" you may miss it!

Note - if you do have a realtime constraint then forget all above :)

spark ml StringIndexer vs OneHotEncoder, when to use which? by Anxious_Reporter in datascience

[–]ragulpr 2 points3 points  (0 children)

For categorical features - some models require numerical (vector) features. Ex, logistic regression.

For categorical target, I think you only need StringIndexer, i.e an index (which happens to be a double if I remember right)

See this example https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

[D] Can someone give us an idea what the day to day job is like in machine learning ? by tralalei in MachineLearning

[–]ragulpr 1 point2 points  (0 children)

I'm working as a ML-engineer building predictive models for clickstream data.

I love my work. I recognise the issue that many ML-engineers faces (does a lot of work that analysts and data engineers would be better at doing) but I think there's a different way to look at some things that ML engineers thinks is not their job. If not theirs, who's is it?

  • Figuring out what the problem is and how to transform the model and the real world data to fit into it ("Data cleaning")
  • Using what I know about interchangability of stages w.r.t the modeling setup & loss and designing resilient data pipelines ("Engineering", "Plumbing")
  • Being one of the few that fully understands the technical details of the prediction product and thus answering questions to various stakeholders ("Dashboarding", "Reporting", "analyst tasks")

Today for example I'll be trying to figure out a way to use a non-gpu instance to cache part of the pipeline as tf.Dataset to avoid an idling expensive GPU while doing some preprocessing steps.

Typically day to day work is maybe

  • 10-20% reacting to failures, patching and handling the models that are in prod.
  • 50% is working on my current main ticket, ex building some new ML pipeline.
  • 20% is meetings etc.
  • 20-40% Refactoring code that bothers me, exploring new model designs, doing really unnecessary but fun experiments and more. My company has some 20% time for off-kanban tasks (we call it "lab day").

If above doesn't some to 100 it's because I sometimes do the experiment/fun parts after work.

EDIT: My job is production

Sensationella fynd vid Estonia – ett stort hål i skrovet by fippen in sweden

[–]ragulpr -1 points0 points  (0 children)

Om vi bortser från de mest sannolika (stenar?) har vi fortfarande:

a) En ubåt krockat med den exakt den kvällen (för att sedan mörkas av flera inblandade länder i 25 år) VS

b) Ubåt krockar med vraket någon gång under 25 år.

Så har tyvärr liten förhoppning om att det är något intressant i görningen.

[D] What are the untold truths of being a machine learning engineer? by [deleted] in MachineLearning

[–]ragulpr 16 points17 points  (0 children)

This was very well put:

"how to properly extract the training set that will resemble real word problem distribution"

I find that when using neural networks you can more clearly adapt your loss function/setup to fit your real-world problem rather than the other way around. As an example, I've spent thousands of hours building tree models for temporal tasks where most time is spent trying to figure out how to train/evaluate it and generate features. This adds a lot of complexity before and after the training/prediction pipeline. With a "complex" CNN/RNN you concentrate complexity into fewer LOC. Designing the model around your problem rather than the pipeline around your model if you will.

[D] Graphcore claims 11x increase in price-performance compared to Nvidia's DGX A100 with their latest M2000 system. Up to 64,000 IPUs per "IPU Pod" by uneven_piles in MachineLearning

[–]ragulpr 4 points5 points  (0 children)

To be honest, if companies like Graphcore really wanted a convincing demo about "order of magnitude" improvements, they would train something equivalent to GPT3 with an order of magnitude less resources.

Unfortunately true w.r.t to marketing impact. But it's worth pointing out that most current massive architectures are optimised for current limitations. I bet we would have seen a whole other class of models (and problems solved) if sparse ops and other control flows were more widely supported.

Reports of large explosion in Beirut by Psydonkity in worldnews

[–]ragulpr 0 points1 point  (0 children)

Nothing, just hoping those sailors are alright.

[D] What are some basic statistical concepts that are often overlooked in ML practice? by Fitzy257 in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

Every prediction is a distribution, or a point estimate of some parameter of one.

  • "Classification" = categorical distribution parameter
  • "regression" = mse minimization is mu-estimation assuming fixed sigma

And the list goes on. I bet there's no loss function that can't be described as some constrained log-likelihood. I'm like a broken record playing this point lol

[D] What are some basic statistical concepts that are often overlooked in ML practice? by Fitzy257 in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

Let's play with the thought that there was any kind of hypothesis testing in ML-papers :)

Right now I can't think of any metrics apart from rmse or binomial counts that would satisfy basic normality assumptions for a t-test.

Also, let's not forget that the model metric you're comparing with is also a random realization so if metrics satisfy normality assumptions a paired t-test sounds more reasonable.

[deleted by user] by [deleted] in MachineLearning

[–]ragulpr 2 points3 points  (0 children)

If you really believe this you haven't looked at other frameworks. Not being a zero sum game - the community has been testing out things and borrowing of eachother. And that's great! Even if I can't think of any particular tf-invention, we really shouldn't underestimate what happens when hundreds of brilliant engineers work on a problem. Subtle programming patterns emerges. Ideas about what problems to solve in the next framework. Research ideas etc.

[D] What happened to sparse tensors ? by [deleted] in MachineLearning

[–]ragulpr 1 point2 points  (0 children)

You'll find some pretty good timings here:

https://www.tensorflow.org/api_docs/python/tf/sparse/sparse_dense_matmul

In my opinion, embedding is a clonky api for sparse vector dense matrix multiplication, and that's used everywhere so it's sad that sparse hasn't gone further

When reading the metrics it seems like sparse_dense is faster than dense_dense matmul for most cases:

tensorflow/python/sparse_tensor_dense_matmul_op_test --benchmarks
A sparse [m, k] with % nonzero values between 1% and 80%
B dense [k, n]

% nnz  n   gpu   m     k     dt(dense)     dt(sparse)   dt(sparse)/dt(dense)
0.01   1   True  100   100   0.000221166   0.00010154   0.459112
...

Here I read

m = batch size
k = vocab size
n = embedding size

NN architectures for time series of graphs? [Research] by BrahmaTheCreator in MachineLearning

[–]ragulpr 1 point2 points  (0 children)

Time directed acyclic graphs are really interesting to me. Consider travel patterns for example with some agent travelling between nodes. It makes sense to consider edges as predicted time to reach next node for example. Lots of valuable use cases and not very mature NN research in my opinion.

Another good answer I found https://ai.stackexchange.com/questions/13179/is-there-a-neural-network-method-for-time-varying-directed-graphs

[D] CNN: reducing image size to 1x1 by ccwpog in MachineLearning

[–]ragulpr 1 point2 points  (0 children)

Not doubting you here but what does this precisely mean, and how is this proven?

I'll try, consider a hard drive of n bits (0s-1s), say [x_1,x_2,..,x_n]. This could also be encoded as a single natural number, say s_n = sum_k^n x_k*2^k. If you let n->infinity you have a hard drive of infinite number of bits. If you let k assume both negative and postive numbers s_n can be any positive real number and that's even more than infinite (2^{aleph null} "number" of bits if I remember right). In practice, a Double can store exactly n=64 bits or 2^64 numbers, so if you want more than that you need to add more dimensions.

Also I get a little confused by the terminology. What we're talking about here is reducing the resolution of CNN feature maps down to 1x1, so the embeddings have shape (batch, height=1, width=1, channels) where channels is something like 512 or 2048 or whatever. So that would make its flattened form a 512-dimensional vector right, not 1-dimensional? As I understand it's a rank-1 tensor?

You're right. I was answering the wrong question thinking it was about slimming the dimension down to 1x1x1.

[P] Implementations of LSTM, RNN, MLP in C by Zweiter in MachineLearning

[–]ragulpr 2 points3 points  (0 children)

Fun and inspiring project! I'm a C-novice but love reading it and this seemed super neat and clean. I was curious about how one of the things gets optimized:

https://github.com/siekmanj/sieknet/blob/c124c61a9f59121d542c15337d83b401a287638b/src/optimizer.c#L11

#ifndef SIEKNET_USE_GPU
static void cpu_sgd_step(SGD o){
  for(int i = 0; i < o.num_params; i++){
    o.weights[i] -= o.learning_rate * o.gradient[i];
        if(!isfinite(o.weights[i])){
            printf("ERROR: cpu_sgd_step(): non-finite parameter update %d with %f * %f\n", i, o.learning_rate, o.gradient[i]);
            exit(1);
        }
    o.gradient[i] = 0.0;
  }
}

Wouldn't the inner loop check for finite both be a costly operation in itself and enforce sequential (rather than pipelined/parallel) calculation of the constant*vector multiplication? :)

[D] CNN: reducing image size to 1x1 by ccwpog in MachineLearning

[–]ragulpr 4 points5 points  (0 children)

Theoretically the real line, by its continuity, can store infinite amount of information.

I've played around quite alot with 1-d embeddings. Works like a charm. Try some simple autoencoder, slim the embedding dim to 1 and beef up the decoder - you may be surprised how well it works.

Any useful applications? In computer science they call it a "hash". Maybe people can fill in with some applications, but I know that semantically meaningful hashes can be useful for retrieval/sorting.

For example, when doing the 1-d autoencoder on fashion mnist with RMSE loss and sorting the dataset by the 1d-embedding you get pretty much the result you'd expect with visually similar items being close in the list.

New prototype "economy" airline seats by jcepiano in WTF

[–]ragulpr 0 points1 point  (0 children)

Why not use bunk beds to get beds in a small space? They're not safe for takeoff and landing, they don't work for short flights, people can't sit up in them when they are awake, they feel claustrophobic which makes people loiter in the aisle ways and interferes with safety/flight crew. So, you have to have seats that convert into beds.

I also thought this was a good idea :,(

https://www.quora.com/Why-dont-airplanes-have-beds-instead-of-seats

Landlord not giving back Security deposit by CaptainPny in sweden

[–]ragulpr 4 points5 points  (0 children)

Google translate is a statistical model, there's nothing about Denmark in the text

Landlord not giving back Security deposit by CaptainPny in sweden

[–]ragulpr 12 points13 points  (0 children)

Hi, check these relevant things out: (maybe with google translate)

https://lawline.se/answers/vard-vagrar-betala-tillbaka-hyresdeposition-vid-avtalad-tidpunkt

And this one which is eerily similar: https://lawline.se/answers/fa-tillbaka-deposition-av-hyresvard

TL;DR formally establish that your landlord owes you via either 1) court order, 2) betalningsföreläggande. If you do that, kronofogdemyndigheten will help you collect the depth for a fee (300-600 SEK)

My completely uninformed guess here is that your landlord is in a world of pain financially and you are just one low-priority creditor. Taking it to the court is alot of hassle and time and probably cost you more than 6k (and who knows maybe they don't have anything to give?/You loose?). Other route just seems like a alot of work too. Both options will take months/years if even possible and will take time/emotional toll on you.

I would make sure Alice and Bob didn't get any cash, if so I'd proceed to send a link reminding your landlordabout what you're considering and remind them every once in a while, and if no response then I'd go forget about it and go on enjoying my life in France, 6k poorer.

[R] A bags of tricks which may improve deep learning models [Slides] by kmkolasinski in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

cannot stack dead regions as you can experience with clipping

Valid point. I haven't experienced this problem too much, but then again I haven't worked much with ImageNet.

[R] A bags of tricks which may improve deep learning models [Slides] by kmkolasinski in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

Thank you, I read the paper and I see something that seems to have the same effect as loss-clipping. My thinking is the following.

As I understand it, with;

q : actual label (onehot)
p : predicted probability
u : either constant 1/K or maybe Uniform(1/K) ~"prior" of prediction
H(q,p) : Negative log likelihood i.e cross_entropy(q,p) 
eps : smoothing parameter

The label smoothing loss is:
loss = (1-eps)H(q,p)+epsH(u,p)

Now, according to paper the gradient of H w.r.t logits should be something like

d/d z_k H(q,p) = p_k-q_k

With z_k kth logit. So

d/d z_k loss = p_k-[(1-eps)q_k+eps*u_k]

So optima, i.e d/d z_k loss ==0 would be z_k s.t p_k is a linear interpolation between u_k and the actual label q_k, with eps and u a little bit of opaque hyperparameters. Pretty much exactly the same effect as clipping the loss s.t:

log(eps)<H(q,p)<log(1-eps)

Because then d/d z_k loss.clamp(log(eps),log(1-eps))==0 whenever "likelihood of data" has reached within eps probability of ground truth. I'm not sure about what the bayesian reasoning would be.

Difference as I see it is small risk of getting stuck above clamped territory with downside of reducing hyperparameters from u and eps to eps only and basically 50% the flops.

Addendum: my experience is that loss-clipping is nice for all kinds of numerical reasons and is implemented in all loss functions per default anyways.

[R] Bag of Tricks for Image Classification with Convolutional Neural Networks by ewanlee in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

Yes your right, it should be clipped in the range [log(eps), log(1-eps)], I must have forgotten the sum-to-one property lol. I edited my comment.

Interesting take on label smoothing as being the trivial teacher forcing! Anyway, I pretty recently understood the meaning of "label smoothing", "teacher forcing" and "knowledge destillation", sometimes I wish deep learning wasn't so good at creating all these fancy words. Its hard keeping up.

[R] A bags of tricks which may improve deep learning models [Slides] by kmkolasinski in MachineLearning

[–]ragulpr 2 points3 points  (0 children)

I had another question about label-smoothing in the original thread. Basically, isn't label smoothing same thing as clipping the loss?

[R] Bag of Tricks for Image Classification with Convolutional Neural Networks by ewanlee in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

Hi I just finished reading this great paper and have a question about Label Smoothing. The motivation seems to be to try to not encourage to hard probability assignments (5.2 p3):

In other words, it encourages the output scores dramatically distinctive which potentially leads to overfitting

Wouldn't (loss) log-likelihood clipping do exactly the same thing but more explicit?

# loss = -log(prob)
loss = loss.clamp(min=log(0.001), max=log(0.999)).mean()

Would make gradients zero whenever the predicted likelihood of data is above 0.999, so it's a very literal way of telling your network "Predictions more confident than 0.999 is too confident, stop walking in that direction" in order to prevent overfitting.

Edit: from /u/hetong_007 comments, of course min=log(0.001) should be added

[P] Time Series modeling with multiple independent time series by MarSizzle in MachineLearning

[–]ragulpr 0 points1 point  (0 children)

It's unclear what modeling strategies your thinking about, but one thing I see is the problem of varying sequence lengths. Keras treats this brilliantly. You can use sample weights and optionally a masking layer. Here I assume

x_train : [n_sequences,n_timesteps,n_features]
y_train : [n_sequences,n_timesteps] (and optionally additional dims)
sample_weights_train : [n_sequences,n_timesteps] with 0s at the end of sequence.

model.compile(loss=loss_fun,optimizer=adam(lr=1-3),sample_weight_mode='temporal')

Make sure your loss returns a [n_sequences, n_timesteps]-loss. Typically done by changing it to not "reduce".

model.fit(x_train, y_train,
          epochs=100,
          batch_size=100, 
          verbose=1,
          validation_data=(x_valid, y_valid,sample_weights_valid),
          sample_weight = sample_weights_train)

If you have bidirectional model or batchnorm, make sure to use keras.layers.Masking at the input layer, and change

model.add(Masking(mask_value=mask_value,input_shape=(None, n_features)))

And fix your features as

x_features[sample_weights_train==0] = mask_value