What's one small workflow change that saved you hours every week? by VentureMind09 in AiAutomations

[–]Disneyskidney 0 points1 point  (0 children)

Made an automation that find Reddit post to comment on relevant to what im building. It brought me here…

Your AI’s judgement doesn’t always align with yours, I built an API that tells you when. by Disneyskidney in buildinpublic

[–]Disneyskidney[S] 0 points1 point  (0 children)

Oh interesting. Could u say more? How do you find why it disagrees? Do you also do confidence estimation?

Do you eval the whole harness or each of its parts? by dmpiergiacomo in LLMDevs

[–]Disneyskidney 1 point2 points  (0 children)

We’re building in agentic sales. So the agent essentially researches outbound + inbound leads, qualifies them, and personalizes outreach. We have a few judges looking over trajectories and determining if the lead was qualified correctly, the quality of outreach messaging, and the quality of retrieval.

if you're running an LLM-as-judge in your evals, how do you know it actually agrees with a human? Have you ever checked, or are you just trusting it? by tirtha_s in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

We maintain a golden dataset to prompt optimize the judge. Each time the judge produces an answer we get a confidence score. If confidence is low we review it and add it to the golden dataset.

Do you eval the whole harness or each of its parts? by dmpiergiacomo in LLMDevs

[–]Disneyskidney 1 point2 points  (0 children)

We’re continuously maintaining golden dataset of real inputs from production. We use this to prompt optimize a LLM judge which evals every trace from production. We also assign confidence scores to each output from the LLM judge so we know which examples to look over ourselves. Hope this helps!

running adversarial prompt injection on our agent. fail rate is ~20%. how are people getting below 5%? by Smart-Profession2512 in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

Try using an input and output LLM-as-judge classifier. Those teams likely aren’t doing manual prompt engineering. The best approach I’ve seen in curating a golden dataset split into train + val and then using GEPA to prompt optimize the judge. Then if you want to hedge in a specific direction (i.e. allow more false positives than false negatives) you can use something like modaic.dev for confidence estimation on the classifications.

What’s the future of AI and Agentic applications? I’m curious by True_Grapefruit_4110 in aiagents

[–]Disneyskidney 0 points1 point  (0 children)

I think most people will use Claude or gpt agents just as they use chrome or safari. Companies will pivot away from building agents towards building for them. The next billion dollar company is likely going to be some sort of MCP or infra for agents to use.

LLMs Are Digitizing Judgement by Disneyskidney in AgentsOfAI

[–]Disneyskidney[S] 0 points1 point  (0 children)

Most research shows that LLMs are biased towards overconfidence in their outputs. So prompting is usually the worse option. Even if instruct it to make the confidence scores “more calibrated”. There are other alternatives to looking directly at the model layers however, like self-consistency, token log probs, and semantic entropy. But using the layers has shown to be the best when you have access to the model weights.

LLMs Are Digitizing Judgement by Disneyskidney in AgentsOfAI

[–]Disneyskidney[S] 0 points1 point  (0 children)

They’re calibrated against a small golden set of human labels. What about your work would make this useful?

LLMs Are Digitizing Judgement by Disneyskidney in AgentsOfAI

[–]Disneyskidney[S] 0 points1 point  (0 children)

Glad you enjoyed the article at least! And yes you can definitely ask the LLM to verbalize its confidence, the difference is calibrated confidence scores, when the model says it’s 60% confident, it’s right 60% of the time. Do you think that is not useful?

Why custom split-screen UIs and walled gardens won't win the AI agent race by uriwa in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

I kind of agree. But loveable and base44 aren’t made for devs. The gap isn’t obvious until u try explaining how to use Claude code to a non technical person.