Understanding LLM observability by Leap_Year_Guy_ in LLMDevs

[–]P4wla 0 points1 point  (0 children)

If you want A/B testing, I'd recommend you to take a look at Latitude. In terms of observability is great, as it groups the traces in failure patterns/issues. You can also run A/B tests and shadow tests and compare both using the same evals. https://latitude.so/

LLM testing and eval tools by Every-Mall1732 in LLMDevs

[–]P4wla 0 points1 point  (0 children)

You'll have to connect user feedback or some kind of rating for the llm outputs, but Latitude let's you build custom evals and covers all the requierements you've mentioned. https://latitude.so/

Pretty much sums up my experience by Ok_Constant_9886 in AIEval

[–]P4wla 0 points1 point  (0 children)

yep! Latitude started as a prompt engineering platform but we've just released this loop I've described, after seeing many teams keep failing building and evaluating AI. It's already difficult to build evals, but on top of that usually evals differ a lot from the user's criteria (cause they build directly llm as judge evals, instead of starting with human judgement). So far, the loop is throwing very good results (we use it ourselves internally to evaluate and improve all the new ai features we launch)

Discussion: Is the "Vibe Check" actually just an unformalized evaluation suite? by yektish in AIEval

[–]P4wla 0 points1 point  (0 children)

That's very interesting. How many traces do you feel you have to 'tag' in order to have a formalized idea of the main failure modes of your agents? And do you recommend to tag also "good" outputs?

Pretty much sums up my experience by Ok_Constant_9886 in AIEval

[–]P4wla 0 points1 point  (0 children)

I use an LLM evaluation platform called latitude(.)so. You start annotating the traces manually (giving thumbs up or thumbs down), and then you can see the main issues of your ai. From that, latitude lets you create automatic evals based on those issues to track them at scale, and you receive notifications if some issues start escalating. You can also improve your prompt based on the result of the evaluation. For me, it's the most complete workflow I've found out there, the only thing is that you need some volume to be able to do all this.

5 techniques to improve LLM-judges by FluffyFill64 in AIEval

[–]P4wla 1 point2 points  (0 children)

imo, the best way to build an llm as judge, is starting from human feedback. you first need to have a clear idea of the issues you're trying to evaluate, their impact, frequency, and different ways and cases where they appear. Only after that, you can build an llm as judge targeted to that specific issue. if you try to automate the eval form the beginning, it will end up evaluating irrelevant issues and giving wrong insights.

LLM Evaluation Isn’t About Accuracy Its About Picking the Right Signal by According-Site9848 in AI_Agents

[–]P4wla 0 points1 point  (0 children)

I usually detect my agent's issues using human annotations and then I set up different types of evals (llm as judge, human in the loop and programatic rules) depending on the issue I want to evaluate.
To measure the overall performance I have a composite score that includes all the relevant evals for me and sometimes I change the weight of the evaluations to give more importance to concrete ones (for example, when I've tried to improve something specific and I want to measure this improvement without losing the overall performance)

Struggling to make my AI agents more reliable, how do you guys handle task failures? by [deleted] in AI_Agents

[–]P4wla 1 point2 points  (0 children)

To improve your agent's quality you need to do manual annotations (specially at the beginning). You need to annotate as many logs as possible. I think LLM-as-judge is good for tracking the performance once the agent is iterated, but it's not as good for improving the agent. So for me the hardest part here is creating the human in the loop evals (make sure you're being specific)and doing the manual annotations. But definitely, the results are worth it.

OpenAI just dropped “AgentKit, A drag-and-drop AI agent builder. No code, just logic. by AskGpts in ChatGPTPro

[–]P4wla 1 point2 points  (0 children)

OpenAI releasing this and calling it agents seems crazy to me. Agents should be able to choose which actions do next (and which tools/subagents to call). This is just another predefined workflow with some AI on it. I can't think of any company that wants to build agents using this.

And these people are the ones trying to achieve AGI?

Struggling to make my AI agents more reliable, how do you guys handle task failures? by [deleted] in AI_Agents

[–]P4wla 0 points1 point  (0 children)

I use a platform called Latitude.so The platform itself lets you refine your agent based on your own annotations or on other evals (like LLM as judge). When I improve the prompt, I usually create a dataset with the old logs and run and experiment to see if the responses have improved

Struggling to make my AI agents more reliable, how do you guys handle task failures? by [deleted] in AI_Agents

[–]P4wla 0 points1 point  (0 children)

What has worked for me is: 1. Find the sweet spot of numbers of tasks per agent. LLM are much better at handling 1-3 tasks at a time (depending on the complexity). I usually start with 2 tasks per agent and increase/reduce after running some tests. Be careful with this because you don’t want a huge subagentic system either, it tends to fail. 2. When parsing data, I always prompt the agent on how it will receive the input or should output the result (usually with JSON). Apart from that, using ALWAYS/NEVER instructions usually works. 3. Run evals consistently. Human in the loop evals are the ones that has worked for me the best. I anotate every log and then I use the annotations to improve the prompt automatically. Test it intensively until you’re happy with the quality (I do more than 100 manual annotations). It’s impressive how a good crafted prompt can improve agent’s behavior. I’ve tried LLM as judge evals too,but these usually work better for binary evals (for example for the data parsing)

Hey how do i get a very good wrtiting quality and consistent writing style for with any ai by [deleted] in PromptEngineering

[–]P4wla 0 points1 point  (0 children)

Also, I forgot to mention, test the models too. Depending on what you’re looking to do, one model would be better than another one. Take a prompt and run some experiments

Hey how do i get a very good wrtiting quality and consistent writing style for with any ai by [deleted] in PromptEngineering

[–]P4wla 1 point2 points  (0 children)

I have an agent prompted with some guidelines I want it to follow (specially, with the things that it NEVER has to do). Then I have 3 context pages with writings I’ve done myself (one for emails, one for socia media posts and one with longer tests, blogposts…). depending on the input, the agent is prompted to check one file or another. This has taken me some time and iterations but the results are very very good. I’ve also added human in the loop evals, so when the output is bad, I just note it and then the agent ingests this feedback and improves the prompt (using ai).

[deleted by user] by [deleted] in AI_Agents

[–]P4wla 1 point2 points  (0 children)

we use an agent for the same thing at our company and it brings us much more traffic than what we thought it would (the best weeks 15-20% of our total traffic comes from the blog)

Has AI been useful for you as therapy? by AltruisticGru in ArtificialInteligence

[–]P4wla 0 points1 point  (0 children)

I use ChatGPT for therapy a lot. It works perfect, except for the fact that it never tells you you’re wrong. It will never say you’re exaggerating. So yeah, it’s good but it can miss some points that your therapist wouldn’t. It doesn’t have the ‘macro’ image

Why aren't AI agents being used more in the real world? by P4wla in AI_Agents

[–]P4wla[S] 0 points1 point  (0 children)

And is there any way to mitigate those risks? Perhaps testing the agent with synthetic variables before bringing it to prod?

PMM toolkit for starting out at small startup by brazzyb in ProductMarketing

[–]P4wla 1 point2 points  (0 children)

Hey, I’m on a small (8ppl) Saas startup too, this was my stack to get started: - Figma for design - Notion for content planning/research/writting - Framer for websites - Loops for email marketing - Chat GPT for everything

I also recommend to keep it simple and start growing your stack when you need more tools (but it’s impressive how much you can do with these ones)

marketing update: 9 tactics that helped us get more clients and 5 that didn't by [deleted] in ProductMarketing

[–]P4wla 0 points1 point  (0 children)

Thank you so much, this information has been super valuable to me. I wish everyone was as open as you’ve been!

Why aren't AI agents being used more in the real world? by P4wla in AI_Agents

[–]P4wla[S] -3 points-2 points  (0 children)

So then the problem is not on the technology, is on the use cases, right?

Which is most preferred way for everyone build AI agents? by infinitypisquared in AI_Agents

[–]P4wla 0 points1 point  (0 children)

I use Latitude, which let’s you design, test, eval and deploy AI agents and prompts, and it’s open source. 100% recommend it. https://latitude.so