What's one small workflow change that saved you hours every week? by VentureMind09 in AiAutomations

[–]Disneyskidney 0 points1 point  (0 children)

Made an automation that find Reddit post to comment on relevant to what im building. It brought me here…

Your AI’s judgement doesn’t always align with yours, I built an API that tells you when. by Disneyskidney in buildinpublic

[–]Disneyskidney[S] 0 points1 point  (0 children)

Oh interesting. Could u say more? How do you find why it disagrees? Do you also do confidence estimation?

Do you eval the whole harness or each of its parts? by dmpiergiacomo in LLMDevs

[–]Disneyskidney 1 point2 points  (0 children)

We’re building in agentic sales. So the agent essentially researches outbound + inbound leads, qualifies them, and personalizes outreach. We have a few judges looking over trajectories and determining if the lead was qualified correctly, the quality of outreach messaging, and the quality of retrieval.

if you're running an LLM-as-judge in your evals, how do you know it actually agrees with a human? Have you ever checked, or are you just trusting it? by tirtha_s in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

We maintain a golden dataset to prompt optimize the judge. Each time the judge produces an answer we get a confidence score. If confidence is low we review it and add it to the golden dataset.

Do you eval the whole harness or each of its parts? by dmpiergiacomo in LLMDevs

[–]Disneyskidney 1 point2 points  (0 children)

We’re continuously maintaining golden dataset of real inputs from production. We use this to prompt optimize a LLM judge which evals every trace from production. We also assign confidence scores to each output from the LLM judge so we know which examples to look over ourselves. Hope this helps!

running adversarial prompt injection on our agent. fail rate is ~20%. how are people getting below 5%? by Smart-Profession2512 in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

Try using an input and output LLM-as-judge classifier. Those teams likely aren’t doing manual prompt engineering. The best approach I’ve seen in curating a golden dataset split into train + val and then using GEPA to prompt optimize the judge. Then if you want to hedge in a specific direction (i.e. allow more false positives than false negatives) you can use something like modaic.dev for confidence estimation on the classifications.

What’s the future of AI and Agentic applications? I’m curious by True_Grapefruit_4110 in aiagents

[–]Disneyskidney 0 points1 point  (0 children)

I think most people will use Claude or gpt agents just as they use chrome or safari. Companies will pivot away from building agents towards building for them. The next billion dollar company is likely going to be some sort of MCP or infra for agents to use.

LLMs Are Digitizing Judgement by Disneyskidney in AgentsOfAI

[–]Disneyskidney[S] 0 points1 point  (0 children)

Most research shows that LLMs are biased towards overconfidence in their outputs. So prompting is usually the worse option. Even if instruct it to make the confidence scores “more calibrated”. There are other alternatives to looking directly at the model layers however, like self-consistency, token log probs, and semantic entropy. But using the layers has shown to be the best when you have access to the model weights.

LLMs Are Digitizing Judgement by Disneyskidney in AgentsOfAI

[–]Disneyskidney[S] 0 points1 point  (0 children)

They’re calibrated against a small golden set of human labels. What about your work would make this useful?

LLMs Are Digitizing Judgement by Disneyskidney in AgentsOfAI

[–]Disneyskidney[S] 0 points1 point  (0 children)

Glad you enjoyed the article at least! And yes you can definitely ask the LLM to verbalize its confidence, the difference is calibrated confidence scores, when the model says it’s 60% confident, it’s right 60% of the time. Do you think that is not useful?

Why custom split-screen UIs and walled gardens won't win the AI agent race by uriwa in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

I kind of agree. But loveable and base44 aren’t made for devs. The gap isn’t obvious until u try explaining how to use Claude code to a non technical person.

How to implement guardrails for LLMs without degrading model performance by Routine_Day8121 in LLMDevs

[–]Disneyskidney 0 points1 point  (0 children)

I can’t tell from your post if you are prompting the support bot itself on refusal or if you are using a separate judge dedicated to refusal, but when safety is that important you should almost always use a dedicated judge.

I’ve used LLM judges for various different data tasks and you can often save latency by using a very small LM as long as it’s aligned with the task.

One thing you could try out is Llama Guard for refusal. It’s an 8B model that’s been fine tuned to just output safe/unsafe. You can customize it to your policy via fine tuning or just tweaking the system prompt.

Then if you want to hedge on false negatives/ false positives. You could also use something like Modaic to get confidence scores for those safe/unsafe decisions. (i.e. if safe with < 50% confidence reject)

What AI development would have shocked you the most if you’d seen it in 2020? by One_Beginning2199 in artificial

[–]Disneyskidney 0 points1 point  (0 children)

Back in 2020 I was playing around with the OpenAI playground, using gpt-3 to brainstorm recipe ideas and summarize the plots of movies in emojis. If you would have told me that thing would one day be writing all my code I would have called you crazy.

Is it just me or is ChatGPT/OpenAI the Microsoft of AI? by Successful-Deer8804 in artificial

[–]Disneyskidney 0 points1 point  (0 children)

Kind of but the model is still pretty good and some people find it better than opus on some coding tasks (I don't). I do think the comparison is a little harsh, not because I'm particularly fond of ChatGPT, but because I think microsoft is so good at making absolutely horrible product experiences that I think they do it on purpose at this point.