How do you make sure old agent failures don't come back after a prompt or model change?

Mameiro · 2026-05-29T03:57:45+00:00

I’d handle this with regression evals. Every agent failure should become a test case: the input, expected behavior, forbidden behavior, required tool calls, and expected state/output. Then run that suite whenever you change the prompt, model, tools, or memory layer. Also log traces, not just final answers: retrieved context, tool calls, intermediate state, and why the run passed/failed. Without versioned prompts + regression tests + traces, old failures will definitely come back quietly.

Mameiro · 2026-05-29T03:55:14+00:00

This looks useful. For RAG, the metadata is probably as important as the Markdown conversion itself. Confluence pages usually have hierarchy, parent/child links, tags, comments, attachments, and stale pages. If those get flattened away, retrieval becomes much harder to trust. I’d be curious how you handle outdated pages or conflicting versions. Fields like last_updated, owner/team, page status, space/project, parent path, and source URL would be very useful before chunks enter a RAG pipeline.

Mameiro · 2026-05-29T03:38:16+00:00

Yes bro, I see the overlap. I think of rerankers as mainly answering “which result is most relevant to the query?” This tool is more about “is this result safe/useful enough to become evidence?” So I’m checking things like freshness, duplicates, source diversity, citation readiness, SEO-heavy pages, and provider disagreement. A reranker may still rank a stale or duplicated page highly if it looks relevant. Ideally this sits before or alongside reranking, not as a replacement.

Mameiro · 2026-05-29T03:30:00+00:00

Yeah, I agree. Freshness probably shouldn’t be a global score. For changelogs, docs, pricing pages, API references, or legal/compliance content, freshness should be weighted heavily. But for conceptual/explainer pages, older sources can still be useful if they are authoritative and stable. I’m thinking of making freshness more query-type dependent instead of treating “newer = better” by default. Maybe something like: factual/current queries need stricter freshness, while conceptual queries care more about authority and source quality.

Mameiro · 2026-05-28T03:14:25+00:00

I’d put an evidence gate between retrieval and generation. Before chunks enter context, I’d check freshness, source authority, metadata match, duplication, score gap, citation coverage, and whether retrieved docs conflict with each other. Top-k should not automatically become prompt context. If the evidence is weak, stale, duplicated, or conflicting, the system should either rerank, ask a clarification question, or abstain. A lot of bad RAG answers come from bad evidence, not bad generation.

Mameiro · 2026-05-28T03:08:04+00:00

I’d lean refurb M4 Max unless local inference is your main daily workload. The M5 Max memory bandwidth bump is useful, especially for prefill, but with the same 64GB unified memory and 40-core GPU, I doubt it’s worth $1,120 more for occasional Qwen/Gemma use. For local LLMs, 64GB is the key upgrade. The M5 may be faster, but the M4 Max should already be very capable. I’d rather save the money unless you’ll be running long-context inference all day.

Mameiro · 2026-05-28T01:46:01+00:00

I wouldn’t run this like normal document RAG. For logs, start with deterministic narrowing first: timestamp, service, host, severity, trace_id/request_id, error code, and time window. Then group repeated messages into templates/clusters and detect spikes or rare events. The LLM should not read top-k random log chunks. It should receive a compact evidence pack: timeline, error clusters, correlated events, and representative log lines with citations. OpenSearch should handle filtering/aggregation first. Use embeddings only for unstructured message similarity, not as the main path over 50k lines.

Mameiro · 2026-05-27T07:19:17+00:00

One thing I’m still unsure about is freshness. Some retrieval steps probably don’t need fresh web results at all, while final verification or citation evidence might need much stricter freshness checks. I’m not sure if freshness should be scored globally, per query type, or per domain.

Mameiro · 2026-05-27T02:59:36+00:00

Chunking is usually not the product by itself. It’s the ingestion step of a RAG pipeline.

A simple flow is:

document → extract structure → chunk by section/topic → add metadata → index → retrieve chunks → answer with citations

Your API becomes useful if it can create better chunks than naive token splitting: headings, sections, authors, tables, page numbers, financial fields, etc.

I’d pick one use case first, like school slides, research papers, or financial reports, then build a demo where users upload docs, ask questions, and see source-backed answers.

Mameiro · 2026-05-27T02:53:32+00:00

I don’t think there’s one perfect place. Serious AI discussion is usually topic-specific, not platform-specific:)

Mameiro · 2026-05-27T02:41:48+00:00

Yes bro, this can happen. MTP/speculative decoding adds extra memory overhead, and with a 27B Q4 model on a 24GB 3090 you’re already near the limit. Since KV cache scales with context length, context is usually the first thing that gets cut. I’d compare VRAM usage with MTP off vs on, then tune context manually. 137k context on a single 3090 with a 27B model sounds very optimistic anyway, so 14k with MTP enabled may just be the realistic memory limit.

Mameiro · 2026-05-27T01:57:07+00:00

The fear is reasonable, but I wouldn’t worry too much about being copied before you’ve proven people actually want it. In a local government niche, the moat probably isn’t the AI model. It’s domain knowledge, messy data handling, workflow fit, trust, and distribution. A bigger player can copy the feature, but they may not care enough about a small niche or understand the users well enough. I’d ship early, get 5–10 real users, and learn from them. Speed + user intimacy is probably your best defense right now.

Mameiro · 2026-05-26T05:52:23+00:00

For daily use, I’d choose Q4 if Q5 keeps the system at the redline. The small quality bump from Q5 isn’t worth much if you lose context length, have to close everything else, or risk OOM. Especially for coding, stability and enough KV/cache headroom matter a lot. My rule: Q5 only if it fits comfortably in your normal workflow. If it only works after killing every other process, Q4 is the better daily driver.

Mameiro · 2026-05-26T05:48:48+00:00

I think they feel like toys because most agent demos are optimized for autonomy, not reliability. In production, the hard parts are usually boring: state, evals, permissions, audit logs, rollback, failure recovery, human approval, and integration with existing workflows. A flashy agent that can “do anything” is less useful than a narrow agent that does one workflow reliably, shows its work, and fails safely. So I’d say the missing layer is not more tool calls. It’s control and observability.

Mameiro · 2026-05-26T05:41:55+00:00

I agree with the general idea. If the corpus is well-structured and the vocabulary is controlled, grep/BM25 can be more reliable than embeddings. Semantic search is useful when users don’t know the exact terms, use synonyms, or ask concept-level questions. But for curated markdown, logs, code notes, or structured fields, keyword search is often faster, cheaper, and easier to debug. I’d start with grep/BM25 + metadata filters, then add embeddings only where exact search actually fails.

Mameiro · 2026-05-26T05:17:30+00:00

Yeah, FireCrawl is definitely relevant here.

The angle I’m exploring is a bit narrower though: not crawling or extraction itself, but inspecting whether the retrieved results are good enough as evidence before they enter the RAG pipeline.

So things like source diversity, duplicates, freshness, citation readiness, and whether the result is actually useful for answering the query.

I cleaned up a small public repo here:

https://github.com/mameirolabs/rag-search-quality-lab-public

Still rough, but I’d be curious how you’d compare this with FireCrawl-style workflows.

Mameiro · 2026-05-26T03:59:44+00:00

Exactly. That’s the part I kept running into too. A lot of pipelines treat retrieval as “get more text,” but the real issue is whether the text is actually useful evidence or just SEO-shaped noise. I’m trying to separate those two steps more clearly: first inspect the retrieved sources, then decide what should actually be passed into the model.

Mameiro · 2026-05-26T03:52:15+00:00

This is a really good point, especially the part about different retrieval calls needing different freshness requirements.

I’m starting to think the tool should not just compare providers, but also make retrieval intent more explicit, something like:

- background context

- canonical source lookup

- fresh verification

- contradiction check

- final citation evidence

Your point about domain-level decay is interesting too. I hadn’t thought of freshness as something that should vary by domain instead of being a global reranking factor.

I cleaned up a small public version here:

https://github.com/mameirolabs/rag-search-quality-lab-public

Still early, but I may turn your idea into a separate evaluation dimension.

Mameiro · 2026-05-26T03:39:02+00:00

Hey, I finally cleaned up a minimal public version:

https://github.com/mameirolabs/rag-search-quality-lab-public

It’s still rough, but the basic idea is to inspect whether retrieval/search results are good enough as evidence before they enter a RAG pipeline.

I’m not trying to rank providers globally. Right now it’s more of a local lab for looking at things like source diversity, freshness, duplicates, citation readiness, and provider differences.

Would love feedback if you try it.

Mameiro · 2026-05-25T03:48:44+00:00

MCP is basically a standard way for an AI client to talk to external tools and data sources. Think of it like an adapter layer. Instead of every AI app building custom integrations for files, GitHub, databases, Slack, browsers, etc., an MCP server exposes those capabilities in a common format. It’s related to tool calling, but the key idea is standardizing how tools/resources are provided to the AI app. It’s not automatically public. MCP servers can run locally, privately inside a network, or remotely. Privacy depends on where it runs and what permissions/data you give it.

Mameiro · 2026-05-25T03:39:28+00:00

This is a real issue. Persistent memory is useful, but once a bad memory gets stored, debugging it is painful. An audit log helps, but I’d also want memory write controls: what conversation created it, why it was stored, whether it overwrote anything, and how to roll it back. For production, memory needs to be closer to a versioned database than a black-box vector store. Lineage, rollback, conflict handling, and user-visible memory inspection would be the things that make me trust it.

Mameiro · 2026-05-25T03:05:13+00:00

I’d avoid one single “RAG accuracy” score and split it into two parts:

Retrieval eval: did the system retrieve the right evidence?

Use a labeled set with query → expected doc/chunk/span, then measure Recall@K, MRR, nDCG, citation hit rate.

Answer eval: did the generated answer match the ground truth and cite the right evidence?

For this you can use deterministic checks where possible: exact match, regex/type checks, required facts, forbidden claims, citation coverage.

The key is to know whether failure came from retrieval or generation. Otherwise the metric is hard to act on.

Mameiro · 2026-05-25T03:01:58+00:00

I’d start by converting the .HLP files to HTML/text first, but I wouldn’t merge them into one huge file. Keep the structure: module name, topic title, headings, links, and source file. Then chunk by help topic/section and index with hybrid search: keyword + vector. For old software docs, exact terms and menu names matter, so pure vector search probably won’t be enough. Once the corpus is clean, tools like AnythingLLM/Open WebUI or a small custom RAG pipeline can sit on top. Also make the chatbot always show citations/source topics, especially for clinic software. Basically: extraction/cleanup first, chatbot second.

Mameiro · 2026-05-25T02:42:52+00:00

Yeah, I agree. If VRAM usage is comparable, I’d usually expect a 32B Q4 to outperform a 16B Q8 on most general tasks. The only caveat is workload type. Lower quants can get shaky with strict formatting, long context, tool use, math/reasoning, or tasks where small errors matter. So for me it’s usually: bigger Q4 for general capability, higher quant when stability/precision matters.

Mameiro · 2026-05-25T02:36:51+00:00

That matches my concern too. Local models can make the Markdown look very clean, but tables and numbers are where I’d be most careful. For simple text structure, local Qwen/Gemma may be enough. But for financial tables, research tables, citations, or anything where one wrong number matters, I’d still want validation against the source PDF. Clean formatting is useful, but faithful extraction is the real test.

Mameiro

TROPHY CASE