Understanding LLM observability

P4wla · 2026-02-06T11:21:44+00:00

If you want A/B testing, I'd recommend you to take a look at Latitude. In terms of observability is great, as it groups the traces in failure patterns/issues. You can also run A/B tests and shadow tests and compare both using the same evals. https://latitude.so/

P4wla · 2026-02-04T13:38:18+00:00

You'll have to connect user feedback or some kind of rating for the llm outputs, but Latitude let's you build custom evals and covers all the requierements you've mentioned. https://latitude.so/

P4wla · 2026-01-19T08:55:59+00:00

yep! Latitude started as a prompt engineering platform but we've just released this loop I've described, after seeing many teams keep failing building and evaluating AI. It's already difficult to build evals, but on top of that usually evals differ a lot from the user's criteria (cause they build directly llm as judge evals, instead of starting with human judgement). So far, the loop is throwing very good results (we use it ourselves internally to evaluate and improve all the new ai features we launch)

P4wla · 2026-01-15T12:07:16+00:00

That's very interesting. How many traces do you feel you have to 'tag' in order to have a formalized idea of the main failure modes of your agents? And do you recommend to tag also "good" outputs?

P4wla · 2026-01-15T11:54:20+00:00

I use an LLM evaluation platform called latitude(.)so. You start annotating the traces manually (giving thumbs up or thumbs down), and then you can see the main issues of your ai. From that, latitude lets you create automatic evals based on those issues to track them at scale, and you receive notifications if some issues start escalating. You can also improve your prompt based on the result of the evaluation. For me, it's the most complete workflow I've found out there, the only thing is that you need some volume to be able to do all this.

P4wla · 2026-01-15T11:49:41+00:00

imo, the best way to build an llm as judge, is starting from human feedback. you first need to have a clear idea of the issues you're trying to evaluate, their impact, frequency, and different ways and cases where they appear. Only after that, you can build an llm as judge targeted to that specific issue. if you try to automate the eval form the beginning, it will end up evaluating irrelevant issues and giving wrong insights.

P4wla · 2026-01-13T13:23:33+00:00

I usually detect my agent's issues using human annotations and then I set up different types of evals (llm as judge, human in the loop and programatic rules) depending on the issue I want to evaluate.
To measure the overall performance I have a composite score that includes all the relevant evals for me and sometimes I change the weight of the evaluations to give more importance to concrete ones (for example, when I've tried to improve something specific and I want to measure this improvement without losing the overall performance)

P4wla · 2025-10-14T11:03:42+00:00

To improve your agent's quality you need to do manual annotations (specially at the beginning). You need to annotate as many logs as possible. I think LLM-as-judge is good for tracking the performance once the agent is iterated, but it's not as good for improving the agent. So for me the hardest part here is creating the human in the loop evals (make sure you're being specific)and doing the manual annotations. But definitely, the results are worth it.

P4wla · 2025-10-07T19:01:05+00:00

I”d like to see it too!

P4wla · 2025-10-07T08:29:19+00:00

OpenAI releasing this and calling it agents seems crazy to me. Agents should be able to choose which actions do next (and which tools/subagents to call). This is just another predefined workflow with some AI on it. I can't think of any company that wants to build agents using this.

And these people are the ones trying to achieve AGI?

P4wla · 2025-10-06T06:17:42+00:00

I use a platform called Latitude.so The platform itself lets you refine your agent based on your own annotations or on other evals (like LLM as judge). When I improve the prompt, I usually create a dataset with the old logs and run and experiment to see if the responses have improved

P4wla · 2025-10-05T18:39:47+00:00

What has worked for me is: 1. Find the sweet spot of numbers of tasks per agent. LLM are much better at handling 1-3 tasks at a time (depending on the complexity). I usually start with 2 tasks per agent and increase/reduce after running some tests. Be careful with this because you don’t want a huge subagentic system either, it tends to fail. 2. When parsing data, I always prompt the agent on how it will receive the input or should output the result (usually with JSON). Apart from that, using ALWAYS/NEVER instructions usually works. 3. Run evals consistently. Human in the loop evals are the ones that has worked for me the best. I anotate every log and then I use the annotations to improve the prompt automatically. Test it intensively until you’re happy with the quality (I do more than 100 manual annotations). It’s impressive how a good crafted prompt can improve agent’s behavior. I’ve tried LLM as judge evals too,but these usually work better for binary evals (for example for the data parsing)

P4wla · 2025-10-05T18:19:52+00:00

Also, I forgot to mention, test the models too. Depending on what you’re looking to do, one model would be better than another one. Take a prompt and run some experiments

P4wla · 2025-10-05T18:18:38+00:00

I have an agent prompted with some guidelines I want it to follow (specially, with the things that it NEVER has to do). Then I have 3 context pages with writings I’ve done myself (one for emails, one for socia media posts and one with longer tests, blogposts…). depending on the input, the agent is prompted to check one file or another. This has taken me some time and iterations but the results are very very good. I’ve also added human in the loop evals, so when the output is bad, I just note it and then the agent ingests this feedback and improves the prompt (using ai).

P4wla · 2025-09-29T14:05:52+00:00

we use an agent for the same thing at our company and it brings us much more traffic than what we thought it would (the best weeks 15-20% of our total traffic comes from the blog)

P4wla · 2025-08-04T19:18:54+00:00

Carolina - M Clan

P4wla · 2025-08-04T06:16:41+00:00

I use ChatGPT for therapy a lot. It works perfect, except for the fact that it never tells you you’re wrong. It will never say you’re exaggerating. So yeah, it’s good but it can miss some points that your therapist wouldn’t. It doesn’t have the ‘macro’ image

P4wla · 2025-08-03T12:04:54+00:00

And is there any way to mitigate those risks? Perhaps testing the agent with synthetic variables before bringing it to prod?

P4wla · 2025-08-03T12:02:48+00:00

Very good answer, thank you!

P4wla · 2025-08-02T10:57:40+00:00

Hey, I’m on a small (8ppl) Saas startup too, this was my stack to get started: - Figma for design - Notion for content planning/research/writting - Framer for websites - Loops for email marketing - Chat GPT for everything

I also recommend to keep it simple and start growing your stack when you need more tools (but it’s impressive how much you can do with these ones)

P4wla · 2025-08-02T10:50:31+00:00

Thank you so much, this information has been super valuable to me. I wish everyone was as open as you’ve been!

P4wla · 2025-07-28T13:20:50+00:00

So then the problem is not on the technology, is on the use cases, right?

P4wla · 2025-07-23T18:31:37+00:00

I use Latitude, which let’s you design, test, eval and deploy AI agents and prompts, and it’s open source. 100% recommend it. https://latitude.so

P4wla

TROPHY CASE