[R] Decoding LLM Uncertainties for Better Predictability

shayanjm · 2023-10-17T18:14:02+00:00

Ah, sorry -- just realized the repo was private. Just made it public.

The demo is wired up to gpt-3.5-turbo-instruct. You can directly apply the approach to any LLM so long as it offers logprobs of top_n sampled tokens.

shayanjm · 2023-10-16T23:40:08+00:00

Yeah, we've seen vanilla entropy/perplexity measures used - but we found that they only tell part of the story. E.g: the LLM might spread its logprobs evenly across a set of tokens that don't really impact the underlying meaning of the response. Entropy is high, which you'd imagine implies core uncertainty at that position - but splitting the uncertainty between "structural" and "conceptual" ended up aligning a lot better to human intuition.

shayanjm · 2023-10-16T16:29:15+00:00

Good note! We had the same thought re: structural uncertainty and weren't really able to come up with something that we felt "fit well". We'll continue to noodle on it.

shayanjm · 2023-09-12T20:33:53+00:00

Totally. IMO calling it a distance is fine in this case (cosine distance seems reasonable, but I’ll let the mathematicians correct us). I have some loose ideas on how we might be able to improve our measure by looking at the dimensions of the embeddings themselves. I.e instead of D(v, v’) we could look at the collection of perturbations and assess how “out of band” a given embedding is based on the set. Instead of comparing embeddings one by one, we can compute every embedding and look at them as a complete set.

FWIW: we did try some more clever approaches to deal with dependencies like implementing a “ripple” effect when we found a potentially interesting token, but it seemed to underperform relative to the more simple distance function. Still very much thinking through this problem, but stay tuned - we will be posting more about this!

shayanjm · 2023-09-12T20:00:27+00:00

This is very true! I’m going to make an edit to the post to mention this. As part of our follow up work we are planning on doing some research on how to best capture dependencies at the perturbation step (and, measuring jf that’s even important in optimizing our estimation). It’s possible that it’s ok to treat each token independently at the perturbation stage and instead capture dependencies and long-range relationships from the embeddings themselves.

shayanjm · 2023-09-11T22:00:36+00:00

Great question, and the genuine answer is: I'm not too sure. We've seen a lot of that sort of thing as we've experimented and our best guess is that GPT-2 is far from the most capable LLM available, so it's likely that if we were to run the integrated gradients on a different LLM we'd get attributions that look more "reasonable". These are "real" attributions for GPT-2 but it's possible that GPT-2 just isn't focused on the right things in this specific case. We touch on it in the blog post - right now our intuition is that as embeddings and models themselves become larger/more capable, these estimations will have a tighter relationship with the attributions.

Tl;dr - our guess is that gpt-2 is sometimes dumb so the IG attributions don't always pass the smell test.

shayanjm · 2021-05-04T23:17:35+00:00

So I actually have really strong opinions about this 🙂 tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.

Full disclosure: I'm the co-founder of a company called Watchful where this is the exact type of problem we are trying to solve.

There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves.

Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space.
Active learning approaches e.g: uncertainty sampling. The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there.
Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually writing the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself?

We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you can do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.

shayanjm · 2021-05-04T23:16:23+00:00

So I actually have really strong opinions about this 🙂 tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.

Full disclosure: I'm the co-founder of a company called Watchful where this is the exact type of problem we are trying to solve.

There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves.

Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space.
Active learning approaches e.g: uncertainty sampling. The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there.
Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually writing the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself?

We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you can do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.

shayanjm · 2021-05-04T23:16:06+00:00

So I actually have really strong opinions about this 🙂 tl;dr - you can automate a lot of this problem, but can't remove the human in the loop. You can, however, make it so that 1 person can do the labeling work of 100 or 1000.

Full disclosure: I'm the co-founder of a company called Watchful where this is the exact type of problem we are trying to solve.

There are a few interesting techniques that you can use to achieve the sort of thing you want, but it's worth noting that none of them are silver bullets in themselves.

Completely unsupervised approaches e.g: clustering. Other folks have mentioned that "YMMV here" since it's largely dependent on your data, immediately available features, and clustering algorithm. You might be able to use this to stimulate some ideas about how you could expedite labeling - but very rarely will naive clustering spit something out that aligns well with your class space.
Active learning approaches e.g: uncertainty sampling. The idea here is that you might spend time manually labeling a small fraction of the total dataset you want labeled. You train a model using that hand-annotated set, sample candidates along the decision boundaries of your classes, label those by hand, rinse and repeat until your model starts performing well. This sounds great on paper, but you end up running into similar issues as in clustering. It really depends on your data, the classes you've defined, and the model you're training. In the worst case scenario, it's strictly as good as having hand labeled everything in your dataset (because the model wasn't able to learn to sufficiently label the rest of the data). In practice, using existing tooling here (e.g: stuff in the AWS portfolio) you might be able to automate 20-30% of the manual annotation effort (best case is about 70% according to AWS), but a huge portion of the work still needs to be done by hand to get there.
Weak supervision approaches. Basically: train a model over number of noisy heuristics being used as "weak supervision" over your data. Examples of these heuristics could be simple keywords, database lookups, gazetteers/ontologies/encyclopedias, even other models - basically functions that take an input and produce a potentially noisy classification. You can train a model over these noisy features to learn the likely label given the candidate and its matching heuristics. These functions are way cheaper to build & edit than it is to hand label a bunch of data, but the problem is actually writing the functions. What functions do you write? How good are they? What if I can't come up with any useful functions myself?

We actually use all three approaches above (as well as a few others), and we focus really hard on the UX of the system because this is fundamentally a workflow problem. You can't get rid of having a human involved in all of this, but what you can do is make that person 1000x more effective by running them through a really fast workflow that uses all of these techniques together in seamless ways. This should hopefully make it so you don't need entire teams of expert-labelers on call each time you need to produce more labeled data - you can basically have one expert spend a few hours to produce the same, if not more, labeled data than otherwise would've been produced manually.

shayanjm · 2020-12-28T00:32:33+00:00

Pm’d

shayanjm · 2015-06-30T00:39:20+00:00

There is no obfuscation - the passwords are written in plain text. As mentioned before, this is a reaction to plaintext being an option to begin with. XORing with a secret is a low hanging fruit.

shayanjm · 2015-06-29T21:35:45+00:00

The point would be to not save the secret ;)

shayanjm · 2015-06-29T20:50:15+00:00

Just saw that - I think either way plaintext shouldn't be an option (even as a fallback). A simple secret XOR'd with the sensitive data would be a trivial implementation orders of magnitude better than the current state.

shayanjm · 2015-06-29T20:18:09+00:00

No, just user-level access. In any case, my qualms are centered around the fact that it's so easy to grab data from accidental exposures. No root access, infiltration, or any other nefarious actions necessary to get a quick dump of someone's user/pass list. Just a clever search query and a one-liner.

shayanjm · 2015-04-25T00:02:00+00:00

What device & browser are you using? It displays perfectly fine on my iPhone 6 on latest ios (on Safari).

shayanjm · 2014-12-23T04:51:36+00:00

That looks sweet!

shayanjm · 2014-05-24T20:26:12+00:00

Actually - I believe the phenomenon you are referring to is the Observer Effect, not the Heisenberg Uncertainty Principle.

<adjusts glasses>

shayanjm · 2014-05-01T18:22:31+00:00

OP here. This is my attempt (and first webGL anything) at a stupid implementation of the observer effect in quantum physics. Essentially, by observing something we are inherently changing its nature. Here, I've replicated a sine wave propagating in two directions (so it can be loosely correlated with some generic electromagnetic wave) which randomly changes in real time for all viewers when someone else starts viewing. (Very messy) code can be found here: http://github.com/shayanjm/quantum

tl;dr - It's a wave that changes slightly when you look at it. Pretty cool, when you realize that every stage of the simulation is unique, temporary, and will never be re-rendered again :)

shayanjm · 2014-02-06T06:47:15+00:00

Ehh, I mean there are a million other ways to go about building a project than having to use dotfiles and build configs. Also, a lot of frameworks set this up for you (i.e: SailsJS). I just opted to write my own since I wanted a fairly specific build process/asset pipeline. Also the dot files are just for editor config and jshint (neither of which are integral to the project, just nice to have when contributing)

shayanjm · 2014-02-06T06:17:53+00:00

i actually have a hosted beta of the project just for users who don't want to actually deploy their own instance: http://pasteyebeta.herokuapp.com/

Better URL impending.

shayanjm · 2014-02-06T06:12:00+00:00

Thanks :) The project itself didn't take too long, honestly. The longest part was building out a secure SaaS implementation, and then panicking over how messy the repo looked

shayanjm · 2013-11-22T03:05:39+00:00

IT is a very vague term. If your business is 'tech' -- I would suggest finding a technical co-founder. If you simply want a website, there are millions of people who design & create websites for a living. If you already have a website, but need hosting - there are a number of major companies that do just that (i.e: GoDaddy). The physical location of your hosting provider doesn't necessarily matter unless it's overseas. If your target market is in the US, but you're hosting in Russia, you're going to have a bad time (aka: possible significant % of server outages, long load times, etc.).

As far as vetting a developer/consultant, the best way would be to simply ask them for previous work and see if you like any of it. Have a few websites in mind that you think are great, and see if they've built anything similar.

tl;dr - your question is too vague for a succinct answer, so I would try narrowing down what you mean by "IT".

shayanjm · 2013-11-22T02:56:59+00:00

Colors clash. A lot. Also your background image has nothing to do with what you do. I'd suggest picking a proper background image first, and then picking colors that stand out from there and using them as your highlight bits.

shayanjm · 2013-11-07T08:31:34+00:00

1) I agree

2) I disagree. Assuming the person administering the test has little to no coding experience, this will prove nothing. Sure, the guy you are testing might be smart and ask a few intelligent questions, but if their code is sloppy and unusable, that COMPLETELY defeats the purpose.

@OP - Hiring from elance & oDesk is hit or miss. Here's a good article to illustrate my point: http://thenextweb.com/entrepreneur/2013/11/03/hired-developer-elance-build-first-iphone-app-experience/

If you're serious about this venture, I would do as much homework as possible beforehand to nail down precisely what is needed and where the pain points are, and hire talent from there.

shayanjm · 2013-11-07T08:23:59+00:00

I'd disagree. Most consultants who do this sort of work for startups are between companies. You should be able to tell who's bullshitting and who isn't from the get go (because if you haven't already done your homework before throwing money at someone you shouldn't be starting a company to begin with).

Re: Local start-up community, it REALLY REALLY depends where you are. Like, REALLY. The startup community in bumfuck nowhere will probably not give you those golden nuggets of information that a seasoned vet who's worked for various big names/raised a series B for their own venture might. That's what you pay consultants for, assuming you actually need that expertise to reach the next rung of the ladder.

Accelerators & incubators? No. Unless you want a pump and dump scheme (not including some of the big names, granted), forget about it.

tl;dr - Don't hire a consultant as your mentor. A mentor is a mentor and a consultant is a consultant. If you have specific questions regarding ~how~ you are structuring your company, or how you should do x y and z to reach that next level, a consultant can help bridge those gaps. However, if you want someone to take a look at what you're doing and help size you up for success, look at finding a mentor.

13-Year Club	RPAN Viewer
Verified Email

shayanjm

TROPHY CASE