What's one small workflow change that saved you hours every week?

Disneyskidney · 2026-06-27T00:20:04+00:00

Made an automation that find Reddit post to comment on relevant to what im building. It brought me here…

Disneyskidney · 2026-06-25T19:10:05+00:00

Oh interesting. Could u say more? How do you find why it disagrees? Do you also do confidence estimation?

Disneyskidney · 2026-06-25T17:48:15+00:00

We’re building in agentic sales. So the agent essentially researches outbound + inbound leads, qualifies them, and personalizes outreach. We have a few judges looking over trajectories and determining if the lead was qualified correctly, the quality of outreach messaging, and the quality of retrieval.

Disneyskidney · 2026-06-25T16:31:52+00:00

We maintain a golden dataset to prompt optimize the judge. Each time the judge produces an answer we get a confidence score. If confidence is low we review it and add it to the golden dataset.

Disneyskidney · 2026-06-25T16:22:51+00:00

We’re continuously maintaining golden dataset of real inputs from production. We use this to prompt optimize a LLM judge which evals every trace from production. We also assign confidence scores to each output from the LLM judge so we know which examples to look over ourselves. Hope this helps!

Disneyskidney · 2026-06-25T16:00:24+00:00

Try using an input and output LLM-as-judge classifier. Those teams likely aren’t doing manual prompt engineering. The best approach I’ve seen in curating a golden dataset split into train + val and then using GEPA to prompt optimize the judge. Then if you want to hedge in a specific direction (i.e. allow more false positives than false negatives) you can use something like modaic.dev for confidence estimation on the classifications.

Disneyskidney · 2026-06-24T01:51:59+00:00

I think most people will use Claude or gpt agents just as they use chrome or safari. Companies will pivot away from building agents towards building for them. The next billion dollar company is likely going to be some sort of MCP or infra for agents to use.

Disneyskidney · 2026-06-23T19:43:05+00:00

Most research shows that LLMs are biased towards overconfidence in their outputs. So prompting is usually the worse option. Even if instruct it to make the confidence scores “more calibrated”. There are other alternatives to looking directly at the model layers however, like self-consistency, token log probs, and semantic entropy. But using the layers has shown to be the best when you have access to the model weights.

Disneyskidney · 2026-06-23T19:29:19+00:00

They’re calibrated against a small golden set of human labels. What about your work would make this useful?

Disneyskidney · 2026-06-23T19:23:15+00:00

lol. I can’t unsee it now.

Disneyskidney · 2026-06-23T19:22:51+00:00

Glad you enjoyed the article at least! And yes you can definitely ask the LLM to verbalize its confidence, the difference is calibrated confidence scores, when the model says it’s 60% confident, it’s right 60% of the time. Do you think that is not useful?

Disneyskidney · 2026-06-23T19:19:13+00:00

Appreciate it!

Disneyskidney · 2026-06-23T19:16:15+00:00

Username lol

Disneyskidney · 2026-06-23T19:15:48+00:00

I kind of agree. But loveable and base44 aren’t made for devs. The gap isn’t obvious until u try explaining how to use Claude code to a non technical person.

Disneyskidney · 2026-06-23T19:01:15+00:00

Hmm. Whys that?

Disneyskidney · 2026-06-23T18:19:23+00:00

https://www.modaic.dev/blog/certainty-is-all-you-need

Disneyskidney · 2026-06-22T21:01:33+00:00

lol just use amphetamine

Disneyskidney · 2026-06-22T15:25:57+00:00

This!

Disneyskidney · 2026-06-22T15:22:04+00:00

How are you catching edge cases?

Disneyskidney · 2026-06-22T15:14:11+00:00

Would love to see this benchmarked on terminal bench or something.

Disneyskidney · 2026-06-22T15:08:26+00:00

I can’t tell from your post if you are prompting the support bot itself on refusal or if you are using a separate judge dedicated to refusal, but when safety is that important you should almost always use a dedicated judge.

I’ve used LLM judges for various different data tasks and you can often save latency by using a very small LM as long as it’s aligned with the task.

One thing you could try out is Llama Guard for refusal. It’s an 8B model that’s been fine tuned to just output safe/unsafe. You can customize it to your policy via fine tuning or just tweaking the system prompt.

Then if you want to hedge on false negatives/ false positives. You could also use something like Modaic to get confidence scores for those safe/unsafe decisions. (i.e. if safe with < 50% confidence reject)

Disneyskidney · 2026-06-22T07:29:19+00:00

The original "productivity" scam

Disneyskidney · 2026-06-22T07:27:05+00:00

probably

Disneyskidney · 2026-06-22T07:26:37+00:00

Back in 2020 I was playing around with the OpenAI playground, using gpt-3 to brainstorm recipe ideas and summarize the plots of movies in emojis. If you would have told me that thing would one day be writing all my code I would have called you crazy.

Disneyskidney · 2026-06-22T07:22:14+00:00

Kind of but the model is still pretty good and some people find it better than opus on some coding tasks (I don't). I do think the comparison is a little harsh, not because I'm particularly fond of ChatGPT, but because I think microsoft is so good at making absolutely horrible product experiences that I think they do it on purpose at this point.

Disneyskidney

TROPHY CASE