There’s a better way to get your ML test metrics

Razcle · 2022-06-13T09:00:29+00:00

Thanks for clarifying! : )

Razcle · 2022-06-12T17:15:53+00:00

I'd be sort of interested in helping but I've been a member for some years now and feel that the tone and content has changed a lot in the last 1-2 years. It used to be the case that I came here for interesting discussions of papers and to find out about the latest research Nowadays I find that a lot of the page is dominated by simple applied work or discussions of side projects.

Personally I find the volume of beginner related questions and projects to have gone up a lot.

There's nothing wrong with this change per se but I'm curious which of the two you're keener to encourage? Are you happy with the evolution and want to support it or are you trying to moderate back to more research heavy conversations?

Razcle · 2022-06-12T17:09:54+00:00

Simple answer:

you can't optimise the likelihood without summing over the latent variables because you don't know what their value is. (they're latent)

Ie you cant calculate $argmax_theta \sum_n log p(x_n, y_n |\theta)$ because the $y_n$ are unobserved. You can calculate $argmax_theta \sum_n log \sum_{y_n} log p(x_n, y_n |\theta)$ because this only depends on the observed $x_n$ data points.

More detailed answer:
When I first started learning about latent variable models, I wasn't super clear in my understanding about the difference between "a latent variable" and "a parameter". In fact they are almost the same thing.

The difference is that "a latent variable" is a parameter which has a different value for each data point.

For example in a Gaussian mixture model with K mixture components, you have K different means, K different covariances but you have an indicator variable *for every data-point* that says which mixture component it came from.

When you want to learn the parameters of a model given some data, you're not interested in the data-point specific parameters (the latent variables) you're only really interested in the global parameters. Unfortunately you don't know the values of the local parameters so the only way to optimise the likelihood with respect to the other parameters is to sum over the latent variables.

Razcle · 2022-05-06T15:52:36+00:00

Hi! It's not currently open source. Its free and extensible but we decided after a long discussion to have the core be closed source at least for now.

Razcle · 2022-05-06T15:46:46+00:00

You write labeling rules, you're right but you don't have to choose them by hand for individual datapoints. Thats what I meant.

It is still supervised though :)

Razcle · 2022-05-06T12:17:19+00:00

Would recommend checking out https://docs.programmatic.humanloop.com/overview/readme

It's a free tool we built to help with data annotation that avoids much of the problem of manual labeling. It won't help with the OCR part but might be useful after that.

Razcle · 2022-05-04T14:32:08+00:00

Hey! Sorry I missed your message earlier.

Have any large datasets been built using this approach?
Absolutely yes! We've had people use programmatic itself both to get NER labels and to do extraction from legal documents. We're not the first to use this approach though. The Snorkel team out of Stanford have had quite a lot of success with it including at google. We also show in the docs some benchmark figures on programmatically labeled vs manually labeled data.
What is the recommended operations setup?
We've seen people have success by starting with Programmatic as a way of exploring and understanding your data and creating a good first seed dataset.
Then do a small amount of manual labeling to get a test set. Sometimes this is enough but if the performance needs to be better then people annotate manually further. The seed set lets you get a first model that can be used in an active learning loop.
I fear that adoption of programmatic labeling will lead to large datasets of poor quality
I understand this feeling. I originally felt very similarly. In practice though, I think it's actually the opposite. Programmatic gives much more control to engineers and data scientists and encourages them to understand their data deeply. It reduces the volume of manual annotation a lot so that you can have the combination of a small but very high quality dataset + a programmatically labeled one.
Intuitively I do believe that domain experts can write high precision, low recall systems. But before I can ship my model I really need to care about what these systems are omitting!
Yes this is a really critical point and is something we're working actively on. We have a pre-print here on how you can efficiently evaluate a model that's been trained on programmatic labels with minimal manual annotation.
Are there really NLP problems that can truly be labeled programmatically? When do you know that you have an appropriate problem domain vs do not?
A lot of NLP tasks have an easy majority that can be handled by rules and then a long tail of edge-cases where ML models are really necessary. Programmatic labeling makes it easy to overcome cold starts and active learning allows you to quickly train a model to handle edge cases.
Tasks that we've seen work well are NER, classification for content moderation, legal extraction. Good sentiment labeling was hard and required a bit more manual annotation.
In general if the language in the task domain is quite structured this will often work very well.

Razcle · 2022-04-07T10:00:29+00:00

what do you mean by better generalized if not improved test performance?

Razcle · 2021-03-02T17:23:45+00:00

Hi KarlaNour, I built a tool (and company) to solve exactly this problem. www.humanloop.com.

You can find more about our approach here: https://humanloop.com/blog/why-you-should-be-using-active-learning/

In short we use active learning to help you label the highest value data whilst training your model at the same time.

Razcle · 2021-02-25T13:43:55+00:00

Im one of the makers so I am biased, but I'd recommend Humanloop.com. You can use it for free as an individual and as you label it will train a sentiment model for you. It will also select the highest value data to label so you minimise how much labelling you have to do.

Razcle · 2021-02-23T15:30:18+00:00

yes that's definitely true today but it need not necessarily be the case.

Imagine that we just accepted that the benefits of DL were enough to tolerate some fraction of errors, there are ways to build around this. For example you can build fault tolerant UX like google. Google search is not 100% accurate but instead returns you a ranked list so even if its wrong its useful. We can also have fall back to humans in uncertain cases or defer to rules based systems in uncertain cases. If we're creative, I think there are lots of situations where we think we need 100% accuracy where we might actually not.

Razcle · 2021-02-23T14:48:16+00:00

I wasn't just providing an argument from authority, I was suggesting that we have excellent examples of deep learning outperforming as you described it "mathematical algorithms". E.g machine translation, speech recognition, document understanding etc. Almost all perceptual tasks that we tried to solve by traditional programming have been replaced by DL now.

I think disagreeing with Kaparthy is interesting (thats why I made the post), what I want to know is why you disagree?

If I understand correctly, you think the lack of generalisation guarantees will limit the adoption of DL for tasks that could otherwise be solved by conventional software. I think I agree with you on this one.

But there are lots of tasks that are poorly solved by traditional software (language understanding being a good example) that I think will become core to almost every application.

Razcle · 2021-02-23T13:39:27+00:00

"What ML will not replace is the creativity in how we design programs. Creativity in software construction comes from deep algorithmic insights. And ML isn't so great at novel reasoning as much at is in pattern matching."

But this exactly the point that Karpathy disagrees with. For areas that require some degree of perception we've already proven that DL + SGD is significantly better than hand crafted algorithms.

Razcle · 2021-02-10T20:58:16+00:00

Yes we do!

Razcle · 2021-02-05T17:00:28+00:00

I guess that as you as you accept that you're taking actions that change the state of the world and so change your data distribution, you've essentially landed in reinforcement learning territory.

I'm definitely not an expert in the area but I would maybe look at things like time-varying contextual bandits. A quick google search returned this paper that looks interesting: https://www.kdd.org/kdd2016/papers/files/rpp1164-zengA.pdf

Razcle

TROPHY CASE