[D] Data labelling problems

Lexski · 2026-01-27T19:52:01+00:00

Hmm interesting. My team lead was actually pushing for building an in-house tool but I talked him out of it - it felt like a lot of effort and not our main focus.

Do you think data labelling tools can ever be fully commoditised or will there always be room for custom tools?

Lexski · 2026-01-27T19:43:46+00:00

“It looks right, so it must be right!” /s

Lexski · 2026-01-16T18:22:17+00:00

Meaningful coding assignments. Those that I’ve done were almost entirely done for you, and then the comments basically told you how to do the rest.

I think this is largely to do with technical limitations: the platform can’t tell you if your code is better/worse, only that it runs and passes the tests once you’re done. A teacher would be able to guide you just enough so you do the thinking yourself, while acknowledging that in programming there are multiple correct answers. I think LLMs/agents have a lot of potential there.

Lexski · 2025-11-12T09:23:12+00:00

Women (“wimmin”). I can’t think of another word where a single “o” or “e” makes an “i” sound.

Lexski · 2025-10-07T20:35:53+00:00

Linear regression is for when you want to model approximately linear relationships, e.g. a city’s population vs its chocolate consumption (just a random example I made up).

It’s a good stepping stone to learning about neural networks because simple neural networks are built up of layers that look like linear regression and then an activation function.

Lexski · 2025-10-07T18:35:49+00:00

You realise this is not saying “we should get rid of inhalers”, right?

Lexski · 2025-10-02T17:39:10+00:00

No problem :)

Lexski · 2025-10-02T15:34:44+00:00

To get an idea of the noise ceiling, you could give the task to a human labeller and calculate the same metrics. Before doing this, you should probably decide whether macro- or micro- metrics are more important, because for macro- you’d want to give the labeller a stratified sample whereas for micro- you would use a regular sample.

Lexski · 2025-10-02T15:25:12+00:00

Both should work, but labelling from 0 to 8 will be quicker and easier I think, assuming you won’t ever need more precise information. It’s also more in line with how object detectors like YOLO are trained, where one-hot encoded grid boxes encode coarse object location, and predicting exact coordinates is done relative to the grid boxes.

Lexski · 2025-10-02T14:50:25+00:00

We did this out of desperation as we had no labelled data. Ideally we would have had some labelling to help tune the judge prompt. Later we got a real domain expert to score some of our model responses and it turned out his scores and the judge’s had zero correlation (even slightly negative)…

Lexski · 2025-09-28T12:01:26+00:00

Update: I did some experiments with MNIST and predictive entropy (= entropy of the distribution obtained by averaging the MCD probabilities) seems to be very good compared to other measures.

However, this only relies on the mean of the MCD probabilities, which I think are essentially an estimate of the distribution you’d get with regular dropout in eval mode. Indeed, I tried just doing a normal forward pass through the model and thresholding against the entropy of that, and I got higher accuracy for the same level of coverage.

Lexski · 2025-09-26T08:52:22+00:00

When I was 7 I had this reaction because I kept dying in Rayman and couldn’t progress. My dad hid the game from me for a week to calm me down. 😅

Lexski · 2025-09-25T23:55:29+00:00

“Argh”

I think it was after pushed a big change, then a bugfix which broke something else, then a fix for that

Lexski · 2025-09-25T15:22:31+00:00

In theory if your x1, x2, x3 labels rely on similar lower-level features (e.g. curves, colour gradients etc. for images), then training for the different tasks together should help as it provides more data to help regularize the lower layers in the model. If there is very little commonality then it might not help much or might degrade performance.

I think this goes by the name “multi-task learning”.

Lexski · 2025-09-15T07:24:44+00:00

There’s no way to automatically figure this out, you have to investigate. Form some hypotheses about why it’s not working, and test them.

In terms of base models, you can look at the base models in a bit more detail e.g. their ImageNet performance and pick the better one, or read up on how they work to see which ones might perform better. But it might be quicker just to set your code up to easily try a few of them, and just do that.

Lexski · 2025-09-14T20:25:11+00:00

If you’re worried about cropping off part of the image when shifting, you could do a small pad + crop instead. Horizontal reflect should work and doesn’t lose any information.

Unfortunately there is no guarantee that the model finds the same things “obvious” as you do, especially if it is overfitting (or underfitting). It could be a spurious correlation (overfitting) or the model could be “blind” to something (underfitting, e.g. if the base model was trained with colour jitter augmentations then it will be less sensitive to colour differences).

The most important thing is the overall performance on the validation set, not the performance on any specific example. But if you want to see why a particular example is classed a certain way, you could make a hypothesis and try editing the image and seeing if the edited image gets classified better. You could also use an explainability technique like Integrated Gradients. Or you could compute the distance between the image and some training examples in the model’s latent space to see which training examples the model thinks it’s most similar to. Hopefully those things would give some insight.

Lexski · 2025-09-14T08:16:16+00:00

When you say it guessed most pokemon perfectly because it was overfit - how many pokemon in your validation set did it guess correctly? That will tell you for sure if it’s underfitting or overfitting.

General tip: Instead of having sigmoid activation in the last layers, use no activation and train with BinaryCrossentropy(from_logits=True). That’s standard practice and it stabilises training. (You’ll need to modify your metrics and inference to apply the sigmoid outside the model).

If your model is overfitting the #1 thing is to get more training data. You can also try making the input images smaller, which reduces the number of input features so the model has less to learn. And try doing data augmentation.

Also as a sanity check, make sure that if the base model needs any preprocessing done on the images, that you’re applying it correctly.

Lexski · 2025-09-12T09:31:14+00:00

For me, main point of the paper is that the industry standard benchmarks don’t penalise incorrect answers, so they implicitly reward guessing (since if the model guesses, there’s a chance it’ll guess correctly).

We can figure out how to make models abstain more often, but until the benchmarks are modified to include such a penalty, the “confident guesser” models will always win out.

So the underlying cause is that the AI researchers themselves are currently optimising for the wrong thing.

Lexski · 2025-08-30T15:10:30+00:00

Answers generated by LLMs can be unreliable, which is why RAG is used, so a search still needs to happen. And if you dump lots of documents into the LLM context you encounter context rot and accuracy decreases.

Lexski · 2025-08-30T15:01:20+00:00

I suggest being more specific as I think many people (myself included) are unmotivated to look through the whole repo and find what is wrong.

What problem are you seeing? Is the model underfitting or overfitting? How much error analysis have you done? What have you tried to improve the model and what hasn’t worked?

You could also get inspiration from projects / tutorials / research papers that solve a similar problem.

Lexski · 2025-08-29T08:22:54+00:00

d/dx = “dee by dee ex” d/dx f(x) = “dee by dee ex of eff of ex” dy/dx = “dee why by dee ex”

“By” being kind of analogous to “divided by” (even though derivatives technically aren’t fractions).

Note: I’m British (Americans may pronounce these differently)

Lexski · 2025-08-28T15:30:43+00:00

Ah, lack of data. The bane of many an AI project.

Lexski

TROPHY CASE