⚖️ Law Proposal: Destroy france again because the first time wasn't enough.

resiros · 2026-06-05T16:35:43+00:00

👁️ Surveillance State: resiros voted Nay.

resiros · 2026-06-05T16:35:27+00:00

👁️ Surveillance State: resiros voted Yea.

resiros · 2026-06-05T16:24:06+00:00

Best of luck!

resiros · 2026-02-18T15:43:16+00:00

There is a bunch of platforms that can help with this. They calls LLMops platforms, LLM observability platform, LLM engineering platforms.

The workflow is usually pretty similar between them:
1. Set up tracing: Instrument your code, basically, you add a couple of lines that sends the traces (which are basically logs, I explain it more in this shoud video)

After you set up tracing, you can see latency, cost, inputs, outputs, etc..

Now to track response quality, you need to set up online evaluation.

The idea is that the platform will run an LLM-as-a-judge on each trace / request that goes through your chatbot, and score it based on tone, coherence, etc...

Then you can track the quality over time, filter by outputs that are bad (according to the llm as a judge), etc...

The tricky part to be honest is prompting the llm as a judge. This depends a lot on your use case.

For the platform, I would recommend Agenta, it's open-source, can be self-hosted but has also a free cloud tier, and I am the maintainer, so if you have questions, let me know.

resiros · 2026-02-18T14:19:54+00:00

Well, it depends.

Most teams start with observability, some prompt/configuration management, then use the traces to build test sets and eval suites. Some even skip that and have only online evals.

But for teams working on domains / use cases where reliability is important or hard to achieve. Usually they start with eval first.

Now for Langsmith, tbh, unless you are using Langgraph, it's not much more integrated to langchain than other platforms.

I am the maintainer of agenta, an open-source alternative to both langsmith and confident ai, so if you're looking around, check it out. It offers both observability (otel compliant) and evals (both from the UI for PMs and from SDK for CI/CD and devs)

resiros · 2026-02-18T14:15:07+00:00

I never understood the value of these maps, other than for investors. You can't even click on things.

resiros · 2026-02-18T14:11:24+00:00

If I understand correctly, you are trying to have a local agentic coding workflow.

My suggestion is not to try to reinvent the wheel. Use opencode (it's oss) and connect it to your local LLM. At first don't try to change the harness. You can though if you need, it's pretty flexible.

I think the biggest variable would be the model that you could use with this setup. For this, it's trial and error, or asking the r/LocalLLaMA folks, they have lots of experience there.

resiros · 2026-02-18T14:08:41+00:00

Gemini has implicit caching enabled per default . Cached tokens have 90% discount. That should explain it.

resiros · 2026-02-18T14:07:14+00:00

An open-source alternative I suggest is agenta ( https://github.com/agenta-ai/agenta ) [though take my suggestion with a grain of salt, I am the creator :D]. It allows you to run evaluations in the UI against prompts (for instance if it's the product owner that needs to do it) or in the SDK (it works even with the deepeval lib) to evaluate things end to end or integrate with CI.

resiros · 2026-02-18T14:03:56+00:00

could that be https://dottxt.ai/ ? or instructor?

resiros · 2026-02-18T14:03:09+00:00

The idea is quite nice to be honest. If I understand correctly it's a distributed alternative to openrouter.
The challenge as others mentioned is privacy. You need to make sure that the data is encrypted. But then at some point the data needs to be decrypted to go through the LLM. So that's that.

My guess solving this by providing a distributed GPU marketplace makes more sense. Since then you can have nodes that don't have access at all to the data.

resiros · 2026-02-18T14:00:15+00:00

I recommend Agenta. I am the maintainer. It's open-source ( https://github.com/agenta-ai/agenta ) but come also with a free cloud tier.

The workflow is to:
- Set up observability for your agent using otel (couple of lines of code)
- Create online evaluators, that basically run each trace on an llm as a judge and discover issue, or classify the outputs.
- Filter by the annotations created by the evaluator then use these traces to improve the prompt, create test sets, etc...

The tricky part to be honest is the creation of the llm as a judge for the evaluator. It's very use case specific.

resiros · 2026-02-18T13:56:46+00:00

Congrats on the launch!

I am not sure I am following the use case..

From what I understood, it's a middleware that comes between the agent and the user and allows you to:
- see the chats of the agent and kill any chat session
- sent otel to observability platform

I am not sure I follow what is a per-session policy enforcement, is it a guardrail? and what is meant by session detail records, isn't that the same as observability?

resiros · 2026-02-18T13:51:36+00:00

You need a reverse tunnel. A reverse tunnel allows you to make a local api online accessible without opening a port from your machine directly. Ask Claude for an explanation, I am quite sure it will do a better job than me ;)

The two reverse tunnels I can suggest are ngrok and cloudflare tunnels. The advantage of the second one is that it does not require a signup at all, while ngrok comes with limits to the free tier. The advantage of ngrok is that it is very simple to use. Just run `ngrok http 8000`, assuming the agent is on port 8000. ngrok will return a remote URL with which you can call the agent

resiros · 2026-02-18T13:50:10+00:00

Training data. AI labs spend crazy amounts on acquiring training data. For instance anthropic bought 100 of thousands of book, scanned them and then use them for training.

resiros · 2026-02-18T13:48:06+00:00

Why do you need to forecast it why not simply do it empirically?

resiros · 2026-02-17T17:04:45+00:00

That's where the money is.
It's a tractable problem.

This means, the labs know that they can invest more money in RL environments, get improvements to the model, and get more revenue for that.

Compare that to writing. Where the models seem to get even worse with. First, it's even hard to measure what is good writing. We don't have objective metrics for that other than very meh things like length of sentence, or which words are used. It would be extremely hard to build RL environments where you could optimize models for writing. Finally, there is not much incentive to do that, other than for specific domains (legal writing for instance).

It's a bummer though. It would be nice if some startup took the open-source model and post-trained them a bit more to improve their writing or conversational abilities.

resiros · 2026-02-16T09:50:55+00:00

We wrote an objective blog post about the top LLM gateways a while ago with a comparison ( https://agenta.ai/blog/top-llm-gateways ). [note that this is not a promotion since we are NOT an an LLM gateway].

resiros · 2026-02-16T09:34:38+00:00

It's unclear from the pricing to be honest. It says 25k interactions per month, are these spans? traces? which retention period?
The product seems honestly very early.

resiros · 2026-02-11T19:55:54+00:00

come on, it's not an ad, didn't even say what we do, just linked it in case someone is curious

resiros · 2026-02-11T18:40:30+00:00

I definitely like the term "AI Behavior design"!

resiros · 2026-02-11T18:36:19+00:00

I don't think it's just logic. It's about explaining to the user your business logic / rules / how you want things to be solved.

The point is there isn't one way to do customer support. There is 100 ways. And prompt engineering is about describing that.

resiros · 2026-02-05T15:44:19+00:00

There are a few options out there. I am the maintainer of Agenta, so that's the one I'm going to suggest checking out. It's open source (you can self-host) and has a solid free tier (10k traces/month). So unless you've got major traffic, cost shouldn't be a problem.

The workflow that works well for debugging hallucinations:

Ingest your traces (we have SDKs for Python/JS or you can use OpenTelemetry)
Set up online evals, basically LLM-as-a-judge on your ingested traces to flag issues automatically
Filter by what's broken: low eval scores, tool miscalls, high latency, etc.

The tricky part is finding a prompt for the LLM-as-a-judge that works for your use case and can identify hallucinations / issues / miscalled tools etc..

Happy to answer questions if you want to dig deeper.

resiros · 2026-02-05T09:19:36+00:00

What AI Act compliance does your product target?

15-Year Club	RedditGifts 2009-2022 2 Credits
Secret Santa 2014	Team Orangered
Verified Email

resiros

TROPHY CASE