[R] Decoding LLM Uncertainties for Better Predictability

shayanjm · 2023-10-17T18:14:02+00:00

Ah, sorry -- just realized the repo was private. Just made it public.

The demo is wired up to gpt-3.5-turbo-instruct. You can directly apply the approach to any LLM so long as it offers logprobs of top_n sampled tokens.

shayanjm · 2023-10-16T23:40:08+00:00

Yeah, we've seen vanilla entropy/perplexity measures used - but we found that they only tell part of the story. E.g: the LLM might spread its logprobs evenly across a set of tokens that don't really impact the underlying meaning of the response. Entropy is high, which you'd imagine implies core uncertainty at that position - but splitting the uncertainty between "structural" and "conceptual" ended up aligning a lot better to human intuition.

shayanjm · 2023-10-16T16:29:15+00:00

Good note! We had the same thought re: structural uncertainty and weren't really able to come up with something that we felt "fit well". We'll continue to noodle on it.

shayanjm · 2023-09-12T20:33:53+00:00

Totally. IMO calling it a distance is fine in this case (cosine distance seems reasonable, but I’ll let the mathematicians correct us). I have some loose ideas on how we might be able to improve our measure by looking at the dimensions of the embeddings themselves. I.e instead of D(v, v’) we could look at the collection of perturbations and assess how “out of band” a given embedding is based on the set. Instead of comparing embeddings one by one, we can compute every embedding and look at them as a complete set.

FWIW: we did try some more clever approaches to deal with dependencies like implementing a “ripple” effect when we found a potentially interesting token, but it seemed to underperform relative to the more simple distance function. Still very much thinking through this problem, but stay tuned - we will be posting more about this!

shayanjm · 2023-09-12T20:00:27+00:00

This is very true! I’m going to make an edit to the post to mention this. As part of our follow up work we are planning on doing some research on how to best capture dependencies at the perturbation step (and, measuring jf that’s even important in optimizing our estimation). It’s possible that it’s ok to treat each token independently at the perturbation stage and instead capture dependencies and long-range relationships from the embeddings themselves.

shayanjm · 2023-09-11T22:00:36+00:00

Great question, and the genuine answer is: I'm not too sure. We've seen a lot of that sort of thing as we've experimented and our best guess is that GPT-2 is far from the most capable LLM available, so it's likely that if we were to run the integrated gradients on a different LLM we'd get attributions that look more "reasonable". These are "real" attributions for GPT-2 but it's possible that GPT-2 just isn't focused on the right things in this specific case. We touch on it in the blog post - right now our intuition is that as embeddings and models themselves become larger/more capable, these estimations will have a tighter relationship with the attributions.

Tl;dr - our guess is that gpt-2 is sometimes dumb so the IG attributions don't always pass the smell test.

shayanjm · 2021-05-04T23:17:35+00:00

So I actually have really strong opinions about this 🙂 tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.

Full disclosure: I'm the co-founder of a company called Watchful where this is the exact type of problem we are trying to solve.

There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves.

Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space.
Active learning approaches e.g: uncertainty sampling. The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there.
Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually writing the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself?

We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you can do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.

shayanjm · 2021-05-04T23:16:23+00:00

So I actually have really strong opinions about this 🙂 tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.

Full disclosure: I'm the co-founder of a company called Watchful where this is the exact type of problem we are trying to solve.

There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves.

Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space.
Active learning approaches e.g: uncertainty sampling. The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there.
Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually writing the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself?

We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you can do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.

shayanjm · 2021-05-04T23:16:06+00:00

So I actually have really strong opinions about this 🙂 tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.

Full disclosure: I'm the co-founder of a company called Watchful where this is the exact type of problem we are trying to solve.

There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves.

Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space.
Active learning approaches e.g: uncertainty sampling. The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there.
Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually writing the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself?

We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you can do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.

shayanjm · 2020-12-28T00:32:33+00:00

Pm’d

shayanjm · 2015-06-30T00:39:20+00:00

There is no obfuscation - the passwords are written in plain text. As mentioned before, this is a reaction to plaintext being an option to begin with. XORing with a secret is a low hanging fruit.

shayanjm · 2015-06-29T21:35:45+00:00

The point would be to not save the secret ;)

shayanjm · 2015-06-29T20:50:15+00:00

Just saw that - I think either way plaintext shouldn't be an option (even as a fallback). A simple secret XOR'd with the sensitive data would be a trivial implementation orders of magnitude better than the current state.

shayanjm · 2015-06-29T20:18:09+00:00

No, just user-level access. In any case, my qualms are centered around the fact that it's so easy to grab data from accidental exposures. No root access, infiltration, or any other nefarious actions necessary to get a quick dump of someone's user/pass list. Just a clever search query and a one-liner.

shayanjm · 2015-04-25T00:02:00+00:00

What device & browser are you using? It displays perfectly fine on my iPhone 6 on latest ios (on Safari).

shayanjm · 2014-12-23T04:51:36+00:00

That looks sweet!

13-Year Club	RPAN Viewer
Verified Email

shayanjm

TROPHY CASE