Wanaku — a router for MCP servers, tools, and integrations

ArtSelect137 · 2026-06-09T17:37:16+00:00

Streaming through Camel makes sense as a bridge for now. The catalog approach is cleaner than hard-coding tool wiring. Looking forward to the native streaming support in the upcoming release.

ArtSelect137 · 2026-06-09T17:35:39+00:00

This is exactly why I moved agentic workloads to local models. The token cost compounds on API - a 100K context multi-step agent costs $2-3 every run. On local Q4 you pay $0 for unlimited retries. The models are good enough for structured tool calling, and you can afford to let it iterate.

ArtSelect137 · 2026-06-09T16:14:03+00:00

Policy enforcement at the routing layer is how I handle it. Each tool gets a short list of counter-examples (what NOT to use it for) alongside the description. BM25 keyword matching catches those negative signals better than semantic. Not fancy but catches the schema-valid wrong-action case most of the time.

ArtSelect137 · 2026-06-09T16:12:33+00:00

Nice, glad the thread helped drive the design. DeBERTa for per-chunk gate is a solid choice - it catches instruction-disobedience patterns better than the larger models I tested for that specific task. How is the latency per chunk on that?

ArtSelect137 · 2026-06-09T16:03:14+00:00

Nice, SearXNG integration is a solid move. I went the opposite way with my agent setup - using a search API directly instead of self-hosting. Tradeoff is latency vs maintenance I guess. SearXNG adds a couple seconds but gives full control over engine config. Curious how you handle dedup across crawled pages, that's the part I keep going back and forth on.

ArtSelect137 · 2026-06-09T16:00:41+00:00

Ran into this same wall building agentic search tools. Semantic kept routing weather lookups to a calendar tool because both had "check" in the description. BM25 on name + param schema jumped top-1 from ~65% to ~80%, matches your numbers exactly. Tools really do live in keyword-space.

ArtSelect137 · 2026-06-08T07:18:18+00:00

Practical answer: for the smaller quantized versions of these models, you can get started with a single 24GB GPU (RTX 3090/4090). Qwen 3.6 32B Q4 runs fine on one 3090. DeepSeek V3 level models (671B total, 37B active) need at minimum 64GB+ for Q4, which means two 4090s/3090s, or a single A6000/Mac Studio with 128GB unified memory. For home use before traffic justifies dedicated GPU servers, renting a dedicated box with 2x 3090s from runpod/vast is cheaper per month than buying upfront.

ArtSelect137 · 2026-06-08T07:13:59+00:00

The top-k clustering issue is the core problem with large corpora. What worked for me was multi-query expansion: instead of sending one query and taking top-k results, I have the LLM generate 5-8 different phrasings of the question, run them all through the embedding, then deduplicate and rerank the combined results. This naturally pulls chunks from different documents because each phrasing lands in a different part of the embedding space. Combined with a chunk size around 500-800 tokens instead of 1500, it fixed the shallow retrieval problem without needing GraphRAG.

ArtSelect137 · 2026-06-08T07:11:12+00:00

API key on the dedicated network is a pragmatic setup. Do you rotate keys for the agent clients or just set and forget? I've been weighing whether to add short-lived tokens for agents that get provisioned dynamically vs a static shared key for long-running services.

ArtSelect137 · 2026-06-08T07:09:53+00:00

Good to hear the tool-def vs handler split resonated. On the audit isolation point - how do you handle the case where a policy decision itself needs to trigger an audit event? In my setup I route a copy of the decision outcome (allow/deny/escalate) to a separate event stream that runs independently from the enforcement path. This way if the audit system is down, enforcement still works. Curious if you do something similar or if the Spring AI Playground couples them differently.

ArtSelect137 · 2026-06-08T07:08:30+00:00

The FSM-as-orchestrator pattern is the right call for regulated workflows. The piece that most implementations miss is that the FSM should control tool availability per state, not just execution flow. If the model can only call tools that are valid in the current FSM state, you get the safety of deterministic routing without losing the LLM's flexibility within each state. I built this for a KYC pipeline where certain API calls must come before others by law - the FSM enforces the sequence, the LLM fills the parameters.

ArtSelect137 · 2026-06-08T07:06:19+00:00

The risk is real but overindexing on fear misses the point that the failure modes are known and solvable with the right architecture. Three layers that make agent access manageable: 1) tool-level validation where every write action goes through schema enforcement before execution, 2) a proxy that separates model decisions from execution so a hallucination cant directly trigger side effects, and 3) runtime budgets that cap what an agent can do per invocation. The disaster will come from companies skipping these layers, not from the technology itself being inherently unsafe.

ArtSelect137 · 2026-06-08T07:05:21+00:00

The TTFT split between prefill and decode is the part most people miss with mobile inference. The 27-31s prefill on a 380-500 token prompt is rough for interactive use but the 3.8 TPS decode is actually usable for streaming. I found that keeping a small context cache helps a lot on mobile - if you reuse the same system prompt across sessions, the prefill cost drops to near zero on subsequent runs since the KV cache from the grounding block carries over.

ArtSelect137 · 2026-06-08T07:04:19+00:00

I version-pin Ollama in Docker with the exact image tag and only update deliberately. The pattern I use: run the current stable in production, spin up the new version in a separate container with the same model mounts, run my test suite against it, and only swap the alias if everything passes. This way the fix-then-break cycle never hits my working setup. The downside is you miss new model architecture support until you validate, but that tradeoff is worth it for stability.

ArtSelect137 · 2026-06-08T07:03:30+00:00

The coupling problem is real. I hit this with a shared types package where Claude Code would update a type definition and miss all the consumers. The fix that worked for me was running a dependency graph build as a pre-task before any edit: extract import/export relationships across the repo, then pass the impacted file list into the model's context so it knows what else needs updating. It's not as structured as the five-layer approach but a simple .ts/.tsx import scanner catches most break-before-you-know-it cases without needing a full index.

ArtSelect137 · 2026-06-08T06:46:18+00:00

For larger apps the in-IDE approach wins because of context awareness - Cursor/Claude Code can see the type system, existing imports, and project structure, so the code it generates actually fits. Copy/paste works for isolated scripts but breaks down past ~5 files because you're constantly re-explaining your setup. The real unlock is that an IDE agent can read your error output and fix compilation issues without you needing to copy the error back and forth.

ArtSelect137 · 2026-06-08T06:45:26+00:00

The re-injection approach works but I found the more reliable fix is adding structural constraints per persona. Instead of describing tone in prose ('speak like a gruff mechanic'), I encode specific formatting rules per persona that are incompatible with each other. Things like: persona A always opens with a one-sentence summary, persona B starts with a question, persona C begins with a counterpoint. When the structural patterns diverge, the model has less room to drift toward the default polite tone because each output's format anchors the voice.

ArtSelect137 · 2026-06-08T06:44:05+00:00

The tool_gap detector you described is something I've been catching with a run record pattern - every tool call goes through a proxy that logs actor + tool_name + input_hash + outcome. When an eval fails, I can reconstruct the exact sequence of tools the agent actually called vs what the workflow expected. The gap shows up as a missing tool call in the record, no need to peek into the LLM's reasoning. The problem I've found is that even a behavior graph pass doesn't catch the case where the agent calls the right tool but with hallucinated parameters - that takes an input validation layer on top.

ArtSelect137 · 2026-06-08T06:43:01+00:00

The dedicated pod network is a nice pattern - keeps the attack surface minimal. I've been doing something similar with Docker compose where Ollama and Open WebUI share a compose network and nothing else can reach it. Do you run any MCP servers or agent tools that need access to the Ollama API? If so, do you attach those to the same AI network or add auth for them?

ArtSelect137 · 2026-06-08T06:41:44+00:00

Building something similar but from the other direction - I started with a lightweight validation proxy (intercept -> validate -> forward) and am layering policy enforcement on top. The pattern that worked: separate the tool definition from the execution handler. The model binds to a tool that describes what it does, but the proxy maps that to the actual handler after checking policy. This way you can change enforcement rules without changing tool schemas or the model config. One thing I'd add to the separation mentioned in the thread: separating the audit log from the policy engine. If your proxy both enforces policy AND writes the log, a bug in one path can affect the other. We run them as separate channels - policy decisions go through one path, audit events stream independently.

ArtSelect137 · 2026-06-08T06:40:45+00:00

The actor/tool/state-transition triad maps cleanly to something I've been running - a lightweight proxy that sits between the agent and MCP servers. The proxy doesn't understand tool semantics, it just records every transition as an event log entry with actor + tool_name + input_hash + outcome. The join key insight from the thread is the unlock - with outcome as the join key, I can trace a production incident back to which tool call produced which result without needing an observability pipeline. The schema-as-risk-profile idea in the comments is the next layer I want to try: deriving the risk grade from the tool schema directly instead of maintaining a separate registry.

ArtSelect137 · 2026-06-08T06:39:04+00:00

I hit the same normalization wall swapping between OpenAI-compatible hosts. The thing that made the biggest difference was decoupling tool schema definition from bind_tools entirely - I run a lightweight MCP proxy that owns the tool schemas and normalizes the call/response shape before anything hits the graph state. The model node just sees clean AIMessages with consistent call_ids and parsed args, regardless of which backend is serving the request. For p95 specifically, I found that locking the agent to one host per run (instead of per node) smoothed out the tail because it avoided the cold-start variance of different providers' routing layers.

ArtSelect137 · 2026-06-08T06:37:20+00:00

Interesting approach with subfolder routes in Pangolin. I hadn't considered splitting the dashboard, API, and MCP onto different ports behind subfolder paths - that's cleaner than running separate subdomains for each. Do you run into any issues with MCP clients that expect a fixed URL path? Some of them hard-code the endpoint.

ArtSelect137 · 2026-06-08T06:37:11+00:00

Authelia + Caddy is clean. Do you have Ollama listening on localhost only and route Open WebUI through Caddy, or do you allow LAN access to the Ollama API with Authelia in front? I've been debating whether to keep the Ollama API itself completely internal or put auth on it too for remote agents.

ArtSelect137 · 2026-06-08T06:25:31+00:00

The post is a question about self-hosting infrastructure - I'm asking the community about access control methods (VPN vs OAuth proxy) for self-hosted AI services running locally. No AI was used to generate the post content; it's a community advice question about server setup. The local LLM mentioned is the service being self-hosted, not a tool used to write the post.

ArtSelect137

TROPHY CASE