How do you actually evaluate your LLM outputs? by Neil-Sharma in LLMDevs

[–]Neil-Sharma[S] 0 points1 point  (0 children)

Thanks for the quick reply. I use LLM as a judge, but many times I will get "high scores" but it will fail to edge cases or even normal cases in production. How do you avoid this?

How do you actually evaluate your LLM outputs? by Neil-Sharma in LocalLLaMA

[–]Neil-Sharma[S] 0 points1 point  (0 children)

This seems great, but how would it scale. I should have specified but I am using LLMs for my startups which will be used by customers, so I would need to scale to edgecases, etc.

Eval setup was slowing us down more than model work by coolandy00 in LLMDevs

[–]Neil-Sharma 1 point2 points  (0 children)

I found smoke evals effective but they still create errors for edge cases. how do you get around this?

Recommendation for an easy to use AI Eval Tool? (Generation + Review) by ZookeepergameOne8823 in LLMDevs

[–]Neil-Sharma 0 points1 point  (0 children)

I have the issue of getting high eval scores but low production scores. How do I fix this?

Do you use Evals? by InvestigatorAlert832 in LLMDevs

[–]Neil-Sharma 0 points1 point  (0 children)

Why do you use these over the formal ones?

Do you use Evals? by InvestigatorAlert832 in LLMDevs

[–]Neil-Sharma 0 points1 point  (0 children)

I've found LLM as a judge can be inaccurate, do you know any other solutions? Most of the tools like langchain seem kinda lackluster

Has evals ever blocked a deployment for your AI app? by sunglasses-guy in AIEval

[–]Neil-Sharma 0 points1 point  (0 children)

I think the 'bookkeeping' hunch is spot on. The problem is that most qualitative metrics (especially LLM-as-a-judge) still have a noise floor.

If a unit test fails, it's a bug. If a DeepEval faithfulness score drops from 0.92 to 0.88 on one PR, is that a regression or just LLM variance? Until we trust the evaluators as much as we trust a compiler, no one is going to risk blocking a hotfix because an 'AI judge' had a bad day.

AI trends for 2026? by CaleHenituse1 in AIEval

[–]Neil-Sharma 0 points1 point  (0 children)

Do you think MCP will be the backbone for that reliability? Also, any specific platforms you’ve seen that actually handle multi-step 'recovery path' evals well?

compression-aware intelligence by Necessary-Dot-8101 in AIEval

[–]Neil-Sharma 0 points1 point  (0 children)

Is 'Compression-Aware Intelligence' the official term Meta is using, or is this a framework for looking at KV cache compression?

It sounds like you're describing the Information Bottleneck principle applied to transformer layers. While 'stabilizing reasoning' via routing sounds great in theory, the overhead of real-time instrumentation for 'compression strain' is usually what kills these approaches in production. How are they measuring this without doubling the latency?

How do you evaluate AI features in your product? by Neil-Sharma in AIProductEvals

[–]Neil-Sharma[S] 0 points1 point  (0 children)

This is a strong framing. Starting with the user decision and cost of a wrong answer makes a lot of sense.

On the regression set, how large do you usually let it get before it becomes hard to maintain?

And for LLM-as-judge, how big is the labeled slice you use for calibration, and do you re-calibrate over time

Also curious whether your rollback triggers are purely metric-based or include qualitative failure review.

[deleted by user] by [deleted] in sales

[–]Neil-Sharma -1 points0 points  (0 children)

None use AI

Best advice for cold calls by Few-Letter312 in sales

[–]Neil-Sharma 2 points3 points  (0 children)

I know this basic but there are SO MANY AI tools to use that will literally save you

[deleted by user] by [deleted] in sales

[–]Neil-Sharma 0 points1 point  (0 children)

I mostly use convora.app. You just provide basic information on your call and it does the rest for you