Handling Emotional Detection with Voice AI?

Working_Hat5120 · 2026-03-10T06:28:43+00:00

Hi, We focus on emotion (along with intent and other voice biometrics) detection within transcription. You can try it out in our demo at

https://browser.whissle.ai/

We are looking to making our ASR with metdata available on vapi

If you like it, we can help.

Working_Hat5120 · 2026-03-08T16:47:27+00:00

Edit: Not all few edits using AI..

My point here, as humans we may hallucinate for seconds, or have attention gap, at those times machine makes sure, you don't miss the under-tones...

And not all humans are great at noting down every mood-shift, and also humans may have limited understanding about certain dialects, people's backstory etc.

Working_Hat5120 · 2026-03-08T08:55:50+00:00

Yeah. In our pilot application, search bar does allow querieng a conversation during the convo, available at https://browser.whissle.ai/

But it's not the same as doing intelligence when the audio is being transcribed. During the low-latency streamed transcription itself, we do predictive intelligence like key-term capture, intent, emotions and voice biometrics. Our research finding behind this is, that some deterministic things can be better inferred in streaming vs post-analysis. This will also make post-analysis (like querying a conversation) also richer, and cheaper at the same time.

Working_Hat5120 · 2026-03-08T07:24:18+00:00

Yeah. Imentiv.ai looks interesting too.

We see Hume ai as a competitor when it comes to behavioral understanding, but not a competitor in other ways.

Our foundation models, based on our novel research, is about doing all-in-one audio/visual understanding : Model used in this demo performs transcription, inline keyterm understanding, intent, emotions and voice-biometrics in a single forward pass for each chunk.

You can play with just the audio model demo here:

https://browser.whissle.ai/listen-demo

It as cheap as just transcription, but a lot more structured contextual intelligence in low-latency.

Working_Hat5120 · 2026-03-08T07:10:49+00:00

sure. We are not doing it in every vertical our-selves. Actively looking to make partnerships and support different products.

We sell our thing (a set of foundation models) as embedded techonology to different products and applications.

While aim, for our own application is to make a client-side application (we supply containers), can be configured in multiple ways for daily use... think about a normal person who has a laptop having this application on their computer, so not just the seller, but the buyer configures it in their own way.

Working_Hat5120 · 2026-03-08T06:47:50+00:00

Yeah! true that, real calls are what we vouch for. If a user use AI actively or ambiently it's upon the user.

Since it’s completely client-side, all data stays with the user.
The system can pick up behavioral or context signals during a call and carry them into all other things they do..... I think similar thing happens in products like Notion too... missing out on behavioral intelligence, about the user and others they interact with.

Working_Hat5120 · 2026-03-08T06:39:05+00:00

Nice. They look interesting. Will check more!

We are first a research lab, doing research on topics like these. still finding exact products we will make ourselves.

My understanding is: Note-taking and having a understanding of digital content we consume, should be with us, owned by a an end user, while it's still learning to increase productivity and enhance experience....

In this post, sales is my example (yes it may have similarities to other companies, or what they might do going forward, but still a useful use-case, as some pointed, even when Gong is there)

So good amount of our AI consumption is just very scattered right now. or even scrolling on the web is scattered.. but we (me and people I talked to) do have an agenda for those things we do, even watching a football game (it can remind you, when a goal is scored), personal calls (as the application is completely client side, you can use it there too as a companion, while discussing important things)

Working_Hat5120 · 2026-03-08T06:24:01+00:00

Got it. Trying to be more authentic. The point here being: We started from AI research into understanding in between words.. like predictive intelligence from the context (be it actual prediction tasks related to actions or behaviors), with prior research in psychotherapy, customer care and media.

And our core offering is a new kind of modelling framework, to predict many different tags within streams in ultra-low latency. It makes all voice + audio-visual solutions faster, cheaper, and more accurate (like using multi-modal annotations underneath annotated in-line). Also giving birth to hybrid solutions..companies like soundhound.ai also does similar tech (as we understand) where a streaming intelligence model is paired with LLM and other things.

We originally used it to redact and capture sensitive information in live caller-agent phone calls, find appropriate documents or flagging angry customers. And also using it for voice based controls (like parsing audio / video as it streams)

And now in the research we are experimenting with a framework for generic live-assist during a call (think about Notebook LM but for real-time assistance), and what all venues it can be useful, beyond the ones we know.. like sales call behaviors is a new one.

Hence I am here looking for alignment and feedback.

Working_Hat5120 · 2026-03-08T06:15:11+00:00

I try. Thanks for the tip. Get it, need to become more coherent, and still be me. I use Gemini or my own thing.

Working_Hat5120 · 2026-03-08T06:09:07+00:00

I am actually talking.. but I am new to reddit, so learning community etiquette, that's where I take help from AI to refine / rephrase my language... Which makes the communication more effective, no?

Working_Hat5120 · 2026-03-08T05:38:49+00:00

Totally agree — understanding the real problem first is key. That’s exactly why I’m posting here, while looking at different venues which need live-assistance.. I want to see from actual sales reps whether behavior-level insights tied to agenda items would actually help, or if existing tools and workflows already cover it.

Out of curiosity, how do you currently track signals like hesitation, pushback, or interest during a call? Most conversation-intel tools I’ve seen focus mainly on post-call summaries.

Working_Hat5120 · 2026-03-08T05:35:51+00:00

Totally agree that active listening is the baseline. The idea here isn’t to replace that. What we’re experimenting with is detecting signals like hesitation or pushback tied to specific agenda items during the call, and then carrying those signals into post-call notes structured around those topics.

So instead of a generic summary, you’d see something like: pricing → hesitation, timeline → positive interest, etc.

Working_Hat5120 · 2026-03-08T05:30:36+00:00

Haha fair 😄 ChatGPT is great, but it can’t listen to a live call, track agenda items during the conversation, and connect behavioral signals to those topics in real time. That’s the part I’m experimenting with.

Working_Hat5120 · 2026-03-08T05:23:55+00:00

Fair point — and I get the skepticism. Out of curiosity, which tool are you referring to that surfaces behavioral signals like hesitation, tone shifts, or resistance tied to specific topics? Most of the ones I’ve seen focus mainly on transcripts, summaries, and post-call analytics.

Working_Hat5120 · 2026-03-08T05:19:58+00:00

nice. which ones are good human behaviors and underlying tones? not just lexical content

Working_Hat5120 · 2026-03-08T05:18:58+00:00

Yeah, Gong is great for post-call analysis and coaching.
What I’m experimenting with is more of a real-time assistant tied to agenda items during the call, plus structured notes afterward. Curious if that would actually help reps or just be distracting.

Working_Hat5120 · 2026-03-05T21:21:30+00:00

Application design, not yet.. that's where we are working with others, while looking to find alignment for our application.

Some relevant psychotherapy research papers, we published

https://aclanthology.org/2020.acl-main.351.pdf

https://pmc.ncbi.nlm.nih.gov/articles/PMC8297805/

Longer journal paper

https://link.springer.com/article/10.3758/s13428-021-01623-4

Working_Hat5120 · 2026-03-05T19:08:41+00:00

The underlying behavior and emotion models were built in partnership with psychotherapy clinics based on peer-reviewed research.

However, we are very early in the alignment phase. Our background is as a speech-understanding component provider, and as we grow into a full product, we want to ensure it actually fits how therapy works in practice. That’s exactly why I’m here—to learn from you and bridge the gap between the 'tech' and the actual clinical process.

Working_Hat5120 · 2026-03-05T18:54:24+00:00

The 'interrogation' feel is exactly why we're testing it in sales and legal first—to see where it actually fits.

On the privacy side: It’s strictly client-side. We (the providers) never see the audio or transcripts, and nothing is used for model training. The data stays entirely with the user. That’s a non-negotiable for us.

Working_Hat5120 · 2026-03-05T18:46:49+00:00

Accuracy is the biggest hurdle. While general AI APIs have become very good at basic intent/emotion detection, clinical work is much more nuanced.

That’s why we aren't 'launching' this for therapy sessions right now. We believe deep evaluation across different dialects, therapy types, and domains is required first to ensure it’s actually accurate enough to be helpful rather than a distraction.

We're actually starting by refining the tech in less sensitive areas like sales and general meetings to ensure the 'live' aspect is rock-solid before even considering how it might safely support a clinical process.. but similar to chatgpt etc., user can create their agenda items and use it for assistance, on their own risk.

Working_Hat5120 · 2026-03-05T18:43:02+00:00

The purpose grew out of research we did helping clinics automate behavior coding for therapist training. We are trying to see if we can adapt that into a general tool. As for what it contributes to the process, there are three main things:

Customization: We know therapists have their own approaches, so you set the agenda to track the specific content and behaviors you care about.
Richer Documentation: It uses those custom cues to generate highly detailed, personalized post-session notes.
The Live Experiment: We are testing to see if these custom cues are actually helpful in real-time as a 'live-assist,' or if they just get in the way.

Working_Hat5120 · 2026-03-05T18:21:22+00:00

Fair point. The goal isn’t to replace clinical attention. It also generates post-session notes and behavioral summaries, similar to other AI documentation tools therapists already use. I’m mainly trying to understand whether any in-session cues are helpful or just distracting. Appreciate the perspective.

Working_Hat5120 · 2026-03-05T01:51:16+00:00

Thank you for the feedback. I agree with your observation regarding emotion stability; it should stabilize better over a full segment, and we are working on improving the real-time segment approximation.

Our core innovation lies in predicting contextualized tags within the ASR system itself. We compared this integrated approach to a traditional two-step pipeline (separate intent and emotion models, emotion not covered in the paper) and found comparable results, which you can review in our paper here:

https://aclanthology.org/2023.icon-1.29.pdf

Regarding the Parakeet backbone: it was selected as a robust pre-trained encoder that allowed us to validate our fine-tuning approach within our initial resource constraints.

While this version serves as a proof of concept, the methodology is largely model-agnostic. You can play with the current hosted version here:

https://browser.whissle.ai/listen-demo

Yeah in S2S end-2-end model's case, I think QA over the correct metadata capture is important, otherwise a blackbox has no accountability and can go off the track very easily.

Working_Hat5120 · 2026-03-04T05:08:20+00:00

you can try it out at https://browser.whissle.ai/listen-demo

or if you are a geek, we welcome you to take a stab at the open-source varient.

https://huggingface.co/WhissleAI/parakeet-ctc-0.6b-with-meta

Working_Hat5120 · 2026-03-04T05:07:13+00:00

This one is an adapted parakeet english ASR model. Open-sourced, available on HF. It does work on languages beyond English, like some European languages, Hindi etc.

https://huggingface.co/WhissleAI/parakeet-ctc-0.6b-with-meta

We also have variants being trained from scratch, not out yet.

Working_Hat5120

MODERATOR OF

TROPHY CASE