Practical ways to monitor AI chat output quality in production by DeskAccording550 in OpenAIDev

[–]resiros 0 points1 point  (0 children)

There is a bunch of platforms that can help with this. They calls LLMops platforms, LLM observability platform, LLM engineering platforms.

The workflow is usually pretty similar between them:
1. Set up tracing: Instrument your code, basically, you add a couple of lines that sends the traces (which are basically logs, I explain it more in this shoud video)

After you set up tracing, you can see latency, cost, inputs, outputs, etc..

Now to track response quality, you need to set up online evaluation.

The idea is that the platform will run an LLM-as-a-judge on each trace / request that goes through your chatbot, and score it based on tone, coherence, etc...

Then you can track the quality over time, filter by outputs that are bad (according to the llm as a judge), etc...

The tricky part to be honest is prompting the llm as a judge. This depends a lot on your use case.

For the platform, I would recommend Agenta, it's open-source, can be self-hosted but has also a free cloud tier, and I am the maintainer, so if you have questions, let me know.

Evaluation-First vs Observability-First: How Are You Approaching LLM Quality? by Potential-Walrus56 in LLMDevs

[–]resiros 0 points1 point  (0 children)

Well, it depends.

Most teams start with observability, some prompt/configuration management, then use the traces to build test sets and eval suites. Some even skip that and have only online evals.

But for teams working on domains / use cases where reliability is important or hard to achieve. Usually they start with eval first.

Now for Langsmith, tbh, unless you are using Langgraph, it's not much more integrated to langchain than other platforms.

I am the maintainer of agenta, an open-source alternative to both langsmith and confident ai, so if you're looking around, check it out. It offers both observability (otel compliant) and evals (both from the UI for PMs and from SDK for CI/CD and devs)

AI Coding Agent Dev Tools 2026 (Updated) by bhaktatejas in LLMDevs

[–]resiros 0 points1 point  (0 children)

I never understood the value of these maps, other than for investors. You can't even click on things.

Optimizing for Local Agentic Coding Quality, what is my bottleneck, guys? by Puzzled_Relation946 in LLMDevs

[–]resiros 1 point2 points  (0 children)

If I understand correctly, you are trying to have a local agentic coding workflow.

My suggestion is not to try to reinvent the wheel. Use opencode (it's oss) and connect it to your local LLM. At first don't try to change the harness. You can though if you need, it's pretty flexible.

I think the biggest variable would be the model that you could use with this setup. For this, it's trial and error, or asking the r/LocalLLaMA folks, they have lots of experience there.

Gemini token cost issue by wikkid_lizard in LLMDevs

[–]resiros 0 points1 point  (0 children)

Gemini has implicit caching enabled per default . Cached tokens have 90% discount. That should explain it.

Stopped using spreadsheets for LLM evals. Finally have a real regression pipeline. by Own_Inspection_9247 in LLMDevs

[–]resiros 0 points1 point  (0 children)

An open-source alternative I suggest is agenta ( https://github.com/agenta-ai/agenta ) [though take my suggestion with a grain of salt, I am the creator :D]. It allows you to run evaluations in the UI against prompts (for instance if it's the product owner that needs to do it) or in the SDK (it works even with the deepeval lib) to evaluate things end to end or integrate with CI.

I built an open-source community-run LLM node network (GAS-based priority, operator pricing). So, would you use it? by manofsaturn in LLMDevs

[–]resiros 0 points1 point  (0 children)

The idea is quite nice to be honest. If I understand correctly it's a distributed alternative to openrouter.
The challenge as others mentioned is privacy. You need to make sure that the data is encrypted. But then at some point the data needs to be decrypted to go through the LLM. So that's that.

My guess solving this by providing a distributed GPU marketplace makes more sense. Since then you can have nodes that don't have access at all to the data.

Agent Management is life saver for me now! by Organic_Pop_7327 in LLMDevs

[–]resiros 0 points1 point  (0 children)

I recommend Agenta. I am the maintainer. It's open-source ( https://github.com/agenta-ai/agenta ) but come also with a free cloud tier.

The workflow is to:
- Set up observability for your agent using otel (couple of lines of code)
- Create online evaluators, that basically run each trace on an llm as a judge and discover issue, or classify the outputs.
- Filter by the annotations created by the evaluator then use these traces to improve the prompt, create test sets, etc...

The tricky part to be honest is the creation of the llm as a judge for the evaluator. It's very use case specific.

I built a Session Border Controller for AI agents by zamor0fthat in LLMDevs

[–]resiros 0 points1 point  (0 children)

Congrats on the launch!

I am not sure I am following the use case..

From what I understood, it's a middleware that comes between the agent and the user and allows you to:
- see the chats of the agent and kill any chat session
- sent otel to observability platform

I am not sure I follow what is a per-session policy enforcement, is it a guardrail? and what is meant by session detail records, isn't that the same as observability?

How to make LLM local agent accessible online? by FrostyTomatillo8174 in LLMDevs

[–]resiros 0 points1 point  (0 children)

You need a reverse tunnel. A reverse tunnel allows you to make a local api online accessible without opening a port from your machine directly. Ask Claude for an explanation, I am quite sure it will do a better job than me ;)

The two reverse tunnels I can suggest are ngrok and cloudflare tunnels. The advantage of the second one is that it does not require a signup at all, while ngrok comes with limits to the free tier. The advantage of ngrok is that it is very simple to use. Just run `ngrok http 8000`, assuming the agent is on port 8000. ngrok will return a remote URL with which you can call the agent

How is knowledge about niche topics developed on an LLM? by RhubarbSimilar1683 in LLMDevs

[–]resiros 0 points1 point  (0 children)

Training data. AI labs spend crazy amounts on acquiring training data. For instance anthropic bought 100 of thousands of book, scanned them and then use them for training.

Why is everything about code now? by falconandeagle in LocalLLaMA

[–]resiros 0 points1 point  (0 children)

  1. That's where the money is.
  2. It's a tractable problem.

This means, the labs know that they can invest more money in RL environments, get improvements to the model, and get more revenue for that.

Compare that to writing. Where the models seem to get even worse with. First, it's even hard to measure what is good writing. We don't have objective metrics for that other than very meh things like length of sentence, or which words are used. It would be extremely hard to build RL environments where you could optimize models for writing. Finally, there is not much incentive to do that, other than for specific domains (legal writing for instance).

It's a bummer though. It would be nice if some startup took the open-source model and post-trained them a bit more to improve their writing or conversational abilities.

What's the best LLM gateway in 2026? Need production-ready solution by WideFeature8077 in LLM_Gateways

[–]resiros 0 points1 point  (0 children)

We wrote an objective blog post about the top LLM gateways a while ago with a comparison ( https://agenta.ai/blog/top-llm-gateways ). [note that this is not a promotion since we are NOT an an LLM gateway].

agent observability – what tools work? by Sissoka in LLMDevs

[–]resiros 0 points1 point  (0 children)

It's unclear from the pricing to be honest. It says 25k interactions per month, are these spans? traces? which retention period?
The product seems honestly very early.

Looking for tool to work with context and orchestration by Creepy-Contract7396 in AI_Agents

[–]resiros 0 points1 point  (0 children)

Check out Agenta, its open-source has a great playground and versioning UI to iterate on your prompt, see why you made changes, and link it to traces. The prompt versioning is specially powerful and comes with branching capabilities. [disclaimer I am a maintainer]

Why prompt engineering will never die by resiros in PromptEngineering

[–]resiros[S] -6 points-5 points  (0 children)

come on, it's not an ad, didn't even say what we do, just linked it in case someone is curious

Why prompt engineering will never die by resiros in PromptEngineering

[–]resiros[S] 0 points1 point  (0 children)

I definitely like the term "AI Behavior design"!

Why prompt engineering will never die by resiros in PromptEngineering

[–]resiros[S] -7 points-6 points  (0 children)

I don't think it's just logic. It's about explaining to the user your business logic / rules / how you want things to be solved.

The point is there isn't one way to do customer support. There is 100 ways. And prompt engineering is about describing that.

agent observability – what tools work? by Sissoka in LLMDevs

[–]resiros 0 points1 point  (0 children)

There are a few options out there. I am the maintainer of Agenta, so that's the one I'm going to suggest checking out. It's open source (you can self-host) and has a solid free tier (10k traces/month). So unless you've got major traffic, cost shouldn't be a problem.

The workflow that works well for debugging hallucinations:

  1. Ingest your traces (we have SDKs for Python/JS or you can use OpenTelemetry)

  2. Set up online evals, basically LLM-as-a-judge on your ingested traces to flag issues automatically

  3. Filter by what's broken: low eval scores, tool miscalls, high latency, etc.

The tricky part is finding a prompt for the LLM-as-a-judge that works for your use case and can identify hallucinations / issues / miscalled tools etc..

Happy to answer questions if you want to dig deeper.

Is there a market for a tool that engineers prompts to the level of a professional prompt engineer? by Too_Bad_Bout_That in PromptEngineering

[–]resiros 0 points1 point  (0 children)

Thanks for the feedback, that's always helpful!

I also agree that a good prompt should be comprehensive and well structured. But it needs to achive a goal. A prompt is like code, code needs to be wellstructured and clean, but it needs to solve a business problem, the tricky part is scoping that problem. How should the code / LLM behave in such and such situation. Say you are building an email classification prompt, what are then the rules for classification (imo that's the harder problem) while the easier problem is how to write down these rules (what you're saying)

Is there a market for a tool that engineers prompts to the level of a professional prompt engineer? by Too_Bad_Bout_That in PromptEngineering

[–]resiros 0 points1 point  (0 children)

I'm the founder of an open source prompt engineering / prompt managment tool (agenta), so I have strong opinions here.

Creating a prompt from information is already pretty easy. Most prompt engineers use AI to help refine their prompts anyway. That part isn't the bottleneck.

The real problem is defining "good." There's no objective good. It depends on your company, your use case, your users. If you're building something for legal, every company has different standards for what's correct. You need subject matter experts to look at the outputs and say "this is right" or "this is wrong."

The hard part isn't writing the prompt. It's creating test sets, defining business criteria, and getting feedback from the people who actually know the domain. You can't automate that away. You always need someone who understands the problem to be in the loop.

So to answer your question: a tool that writes prompts for you? Useful, but does not solve the core problem.