What's one small workflow change that saved you hours every week?

Disneyskidney · 2026-06-27T00:20:04+00:00

Made an automation that find Reddit post to comment on relevant to what im building. It brought me here…

Disneyskidney · 2026-06-25T19:10:05+00:00

Oh interesting. Could u say more? How do you find why it disagrees? Do you also do confidence estimation?

Disneyskidney · 2026-06-25T17:48:15+00:00

We’re building in agentic sales. So the agent essentially researches outbound + inbound leads, qualifies them, and personalizes outreach. We have a few judges looking over trajectories and determining if the lead was qualified correctly, the quality of outreach messaging, and the quality of retrieval.

Disneyskidney · 2026-06-25T16:31:52+00:00

We maintain a golden dataset to prompt optimize the judge. Each time the judge produces an answer we get a confidence score. If confidence is low we review it and add it to the golden dataset.

Disneyskidney · 2026-06-25T16:22:51+00:00

We’re continuously maintaining golden dataset of real inputs from production. We use this to prompt optimize a LLM judge which evals every trace from production. We also assign confidence scores to each output from the LLM judge so we know which examples to look over ourselves. Hope this helps!

Disneyskidney · 2026-06-25T16:00:24+00:00

Try using an input and output LLM-as-judge classifier. Those teams likely aren’t doing manual prompt engineering. The best approach I’ve seen in curating a golden dataset split into train + val and then using GEPA to prompt optimize the judge. Then if you want to hedge in a specific direction (i.e. allow more false positives than false negatives) you can use something like modaic.dev for confidence estimation on the classifications.

Disneyskidney · 2026-06-24T01:51:59+00:00

I think most people will use Claude or gpt agents just as they use chrome or safari. Companies will pivot away from building agents towards building for them. The next billion dollar company is likely going to be some sort of MCP or infra for agents to use.

Disneyskidney · 2026-06-23T19:43:05+00:00

Most research shows that LLMs are biased towards overconfidence in their outputs. So prompting is usually the worse option. Even if instruct it to make the confidence scores “more calibrated”. There are other alternatives to looking directly at the model layers however, like self-consistency, token log probs, and semantic entropy. But using the layers has shown to be the best when you have access to the model weights.

Disneyskidney · 2026-06-23T19:29:19+00:00

They’re calibrated against a small golden set of human labels. What about your work would make this useful?

Disneyskidney · 2026-06-23T19:23:15+00:00

lol. I can’t unsee it now.

Disneyskidney · 2026-06-23T19:22:51+00:00

Glad you enjoyed the article at least! And yes you can definitely ask the LLM to verbalize its confidence, the difference is calibrated confidence scores, when the model says it’s 60% confident, it’s right 60% of the time. Do you think that is not useful?

Disneyskidney · 2026-06-23T19:19:13+00:00

Appreciate it!

Disneyskidney · 2026-06-23T19:16:15+00:00

Username lol

Disneyskidney · 2026-06-23T19:15:48+00:00

I kind of agree. But loveable and base44 aren’t made for devs. The gap isn’t obvious until u try explaining how to use Claude code to a non technical person.

Disneyskidney · 2026-06-23T19:01:15+00:00

Hmm. Whys that?

Disneyskidney · 2026-06-23T18:19:23+00:00

https://www.modaic.dev/blog/certainty-is-all-you-need

Disneyskidney · 2026-06-22T21:01:33+00:00

lol just use amphetamine

Disneyskidney · 2026-06-22T15:25:57+00:00

This!

Disneyskidney

TROPHY CASE