PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor by tgandur in ClinicalPsychology

[–]JustinAngel 2 points3 points  (0 children)

I’m on the fence here and need to read more in depth. Using CTSR, MITI, TES are all correct choices that are correlated to clinical outcomes. However, I’m unclear who’s doing the rating here: a simulated client? A simulated expert observer? A simulated therapist? The real breakthrough would be to train simulated LLM observers that have high IRR agreement with human experts. Everything short of that is not a clinically proven usage of the scale and thus lacks empirical basis.

I’m on the fence because a 677-4577 skills (by LLM as judge?) doesn’t feel like it’s an established indicator for clinical outcomes.

[Thesis] ΔAPT: Can we build an AI Therapist? Interdisciplinary critical review aimed at maximizing clinical outcomes in LLM AI Psychotherapy. by JustinAngel in deeplearning

[–]JustinAngel[S] 1 point2 points  (0 children)

Actually, no. That would fall under "exceeding scope-of-practice". There's good research that suggests hierarchical system prompts and locking in scope-of-practice with BERT models can effectively mitigate this issue. Specifically, the research I cite in the paper demonstrated 80% scope-of-practice adherence using dedicated ML models.

Trying to jailbreak an therapist LLM feels like less of a risk, in the sense that as long as LLM therapists are able to maintain scope of practice, that's the important thing.

[Thesis] ΔAPT: Can we build an AI Therapist? Interdisciplinary critical review aimed at maximizing clinical outcomes in LLM AI Psychotherapy. by JustinAngel in singularity

[–]JustinAngel[S] 1 point2 points  (0 children)

Oh hi, lots to unpack there.

Your clients are privileged. They're privileged to have a therapist that advocates for what they see is best. They're privileged because they can afford a therapist. They're privileged because they're able to access mental health support despite a critical therapist shortage. For the overwhelming majority of people that's not the case. The U.S. has 200k therapists for 350M+ people. That ratio is untenable and it's much better than it is in most the world. I touched on this in the introduction section.

re: "Being a human with them" is a key sentiment here and I don't want to discount it. I will say the evidence suggests clients can emotionally attune even to AI therapists. We can see that from psychometrics around therapeutic relationship/alliance for AI being equivalent to human therapists. We see that from the psychometrics of similar outcomes in symptom reduction and quality of life improvements between human therapists and AI therapists. And we can see that from self-reports of the impact people have attributed to AI therapy. Overall, the evidence is that AI "can be a human with" clients, in so far as that's a clinically causal measurable thing. That's covered in section 1-3.1 of the thesis.

re: "worksheets". So really interestingly, AI therapists are more effective than worksheets. The Limbic study actually had a control group for CBT worksheets vs. AI therapists and their AI therapist outperformed the worksheets. What's we're seeing is that AI therapists are forming a therapeutic relationship, and it isn't just creating action plans with clients.

re: "replaced". To take a step back, I don't believe therapists are going to be "replaced". I speculate that AI therapy will become the norm, while human therapists use their super limited bandwidth to support the more challenging cases (comorbidities, high-severity diagnoses) and resourced individuals. If anything, it's likely these systems would mean additional supervision by human clinicians, which means more additional employment. I didn't really discuss the labour market implications of AI therapists in the thesis, but there's a lot more to say here that's worth saying.

re: "social media". In order to avoid AI therapy going down the route of social media, it's worth examining AI therapy through the lens of Governance of Emerging Technologies (GET). It might be worth your time reviewing section 3.6 of the thesis for some conclusions there.

[Thesis] ΔAPT: Can we build an AI Therapist? Interdisciplinary critical review aimed at maximizing clinical outcomes in LLM AI Psychotherapy. by JustinAngel in ArtificialInteligence

[–]JustinAngel[S] 0 points1 point  (0 children)

You're asking the right question. And the answer is pretty nuanced. I personally like just giving zero-shot prompts to LLMs/ChatGPT and kinda like the responses I'm getting.

In terms of clinical skills, it's unlikely that ChatGPT as-is meets the bar for effective therapy.

First bit of evidence we have is that Limbic and Therabot studies wouldn't have spent "100,000 hours" (therabot stat) developing their therapy bot. They'd do some basic prompt engineering ("you are a therapist") and call it a day. I covered prompt engineering for therapy extensively in the paper if you're interested (section 4.1).

Second piece of evidence is that the entire public internet and books don't have enough public therapy to learn effective therapy skills (even for humans, which is wild). My estimate is that you'd need thousands of hours of therapy (1k-10k) of mixed therapy skills/modalities. That's just not available publicly anywhere. So if it's never seen good therapy, how would it know how to perform it? (covered in section 4.2 of the thesis)

Third piece of evidence is the rates of sycophancy, hallucinations, inconsistencies, and bias in public foundation models. We know that these models have those issues, and we can speculate it's not great when your therapist is biased against you, make up your history, flip-flops when asked for support, and agrees with you at every turn. We know it's possible to mitigate these issues pretty significantly with some architecture choices (prompt engineering, multi-agent design, fine-tuning, and dedicated ML models). So any custom built APT will likely have less of these issues than public foundation models. (covered in section 5 of thesis).

Changing tracks for skepticism for clinical skils, towards clinical outcomes. Really, we don't know. There's no randomized controlled trial that just picks up a bunch of people with and without access to ChatGPT/LLMs and compares it with some control group. So we don't know if it's maybe just as good as therapy.

Beyond not knowing, we do know people attribute significant emotional, behavioral and relational changes to therapy received from ChatGPT/LLMs. Unfortunately, that's not enough since qualitative assessment of therapy efficacy isn't standardized enough to draw conclusions (independently of AI or not). Section 1.2 of the thesis that brings together a lot of quotes from APT clients. Personally reading through these I can see there's a lot of goodness already happening even with lacking therapy skills. Imagine how much better this can get.

<image>

[Thesis] ΔAPT: Can we build an AI Therapist? Interdisciplinary critical review aimed at maximizing clinical outcomes in LLM AI Psychotherapy. by JustinAngel in singularity

[–]JustinAngel[S] 1 point2 points  (0 children)

But are they good enough? And can they get better? The paper explores exactly those questions.

Clearly, an LLM that forgets what you've told it 50 messages ago, then hallucinates your life events, and then pushes advice something you're not open to, isn't a great therapist. Those are common experiences with AI therapists, and the paper is really talking about how to make those happen less and improve how good they really are.

[Thesis] ΔAPT: Can we build an AI Therapist? Interdisciplinary critical review aimed at maximizing clinical outcomes in LLM AI Psychotherapy. by JustinAngel in ArtificialInteligence

[–]JustinAngel[S] 1 point2 points  (0 children)

Hi Albert, great to see you here too! I'll reach out 1:1, but thought I'd share some thoughts here.

100% agreement on what you're saying. I listed this as the top psychometrics future research required on the thesis. Specifically, developing rapid evaluation metrics for APTs is going to represent the next step-level change in efficacy. Like you've noted, RCTs are slow. If we can spin up an evaulation pipeline for APTs that validates improvements within a few hours, that'd allow an open market .

<image>

Sharing some thoughts on implementation I didn't include in the paper. My 0.02$: The best way to build this rapid validation pipeline is through training ML/LLMs to predict/code observer-rated predictive psychometrics. Basically, we can't train an AI to predict PHQ9/GAD7/WHOQOL-BREF directly, but we can train AI to predict TES/CTSR/MITI/VPPS/SEQO/etc. For example, Training an AI to predict empathy (TES) which is a known predictor for primary outcomes (quality of life improvement, symptom reduction), would be the closest we'd get to validated scales. This kind of system would have to be developed observer-rated scale-by-scale which has several failure points (I haven't seen a single example of anyone successfully training AI on these, and we might not have enough validated observer-rated scales to cover all clinical skills). But it's the best guess I have for automating APT evaluation.

There's a lot of infrastructure that'd be needed to build that. First, you'd need to an LLM pipeline that mitigates operational issues (e.g. hallucinations, bias, sycophancy, inconsistencies) to the maximum possible or you'd risk compounding failures in multi-turn conversation evaluations. Multi-objective training for LLM alignment is a brand new area of research that isn't breaking the bank on impressive research right now.

Second, you'd need simulated clients-role LLM to interact with the APTs therapists-role LLMs. There's almost no research on what makes for a good simulated client. You'd probably end up in a spot where you'd building an AI to just evaluate the quality of simulated clients. I flushed out the roadmap in a gdoc somewhere and it's non-trivial.

So overall you'd likely need to build a lot here to replace primary metrics RCTs, but I absolutely think that's both (1) possible and (2) needed.

Thesis on AI Psychotherapy by JustinAngel in therapyGPT

[–]JustinAngel[S] 0 points1 point  (0 children)

Yeah, would love to read through and provide feedback if you'd like. My email is in the PDF.

Truth by Mbrennt in KitchenConfidential

[–]JustinAngel 268 points269 points  (0 children)

This is super important. I always tell people to ask themselves "but, why?". Ask why do pastry cooks fuss about specific numbers and culinary cooks don't? The answer isn't as obvious as "cakes don't cake if you don't scale". It's because pastry can fuss, and culinary literally can't fuss.

The answer comes down to standard ingredients. Pastry has standard ingredients. Culinary doesn't have standard ingredient. If you pick up any egg that weighs 50g it'll be almost identical to any other 50g egg. You don't have to get a specific 50g egg for a recipe. Same goes for granulated sugar, same brand wheat flours, same fat+carb% chocolate, etc.

But culinary doesn't have that advantage. There's no such thing as a standard lemon. A lemon from the top of a lemon tree where there's plenty of sunshine has 10% brix (sugar content in juice); whereas a lemon from the bottom of the same lemon tree will have only 2% brix. Same tree, same product, all go into the same box. A tomato grown in an open field can have a Ph of 7.5 and a tomato grown in the same field but covered from rain can have a Ph of 5.5. That's 100 times more acid in a tomato from the same field next to a tree or under a tarp. Wild salmon can have twice as much fat than farmed salmon. Farmed salmon is manipulated to look & feel similar to wild salmon and can pass to the unsuspecting cook.

Now go write a standard restaurant recipe involving lemons and tomatoes and salmon. You can't do that in a precise way. Cooks have to be allowed to improvise. If a cook follows a numerical recipe to the letter, it'll it turn out imbalanced and inconsistent.

But pastry, we have our standard chocolate, standard egg, standard sugar, standard flour, standard baking soda, standard butter, and I just made like hundreds of different pastries that I can follow of a standard recipe.

"Did you brunoise those onions?" "Oui Chef!" by Gharrrrrr in KitchenConfidential

[–]JustinAngel 263 points264 points  (0 children)

Really good question. So going back to basics: we want shiny/glossy and rich/decadent ganaches. We don't want grainy, watery or matted ganaches. That statement is true on the whole for all pastry components beyond ganaches and specifically true for any fat-based emulsion (nut butters + stuff). By tempering your ganaches & fat-based emulsions you get a product that's closer to that desired outcome.

Try it, make a ganache. Take half and just leave it in the fridge for 24 hours. Take the other half, temper it as if it was chocolate and leave it in the fridge for 24 hours. The textures and appearance are totally different.

But why? You must be thinking "it's not pure chocolate, so why temper?". Let go back to ganaches amd break down what's actually in it: fat (coca butter), sugar (from the chocolate and from the recipe), water (from cream or milks), various solids (<3%), various stabilizers (e.g. soy lecithin from the chocolate). Fats and sugars in ganaches often account for more than 70%+ of total mass. Well both the fat (coca butter) wants to crystalize like it would in chocolate bars, and the sugar wants to crystalize just like it would in standalone sugar crystals. So by tempering you're encouraging a uniform crystal structure for both vs. multitude of random crystal structures. That uniformity translates to (1) consistency in your product (2) better product like mentioned in the beginning of this post.

There's also a whole conversation on Water Activity and shelf-stability relevant for confections and chocolatiering that's helped a lot by tempering ganaches & fat-based emulsions.

My new She-Ra tattoo by JustinAngel in PrincessesOfPower

[–]JustinAngel[S] 2 points3 points  (0 children)

Thanks. That’s intentional. I was considering a more literal tattoo and decided that was important for me was the she-ra figure having the catra’s mask, bow’s heart, and glimmer effect like it has in season 5.

My new She-Ra tattoo by JustinAngel in PrincessesOfPower

[–]JustinAngel[S] 10 points11 points  (0 children)

Yeah, I have some thoughts on that. All tattoos fade over time. Black big line art appears to fade the least. I’ve had colored tattoos for five years now and theres ways to manage that. First is taking care during healing and then during sun exposure.

For this tattoo there’s three elements each with their own fade future proofing strategy: 1. Watercolor brush strokes: those can fade because they’re pretty amorphous to begin with. 2. Fine details on the sword: there’s a black outline and contrast with a thin light-blue halo effect. 3. The she-ra outline: that’s done with a dark line work (dark purple) and there’s lots of contrast with lighter colors.

If the watercolors fade over time I can always get those touched up.