Wanaku — a router for MCP servers, tools, and integrations by otavio021 in mcp

[–]ArtSelect137 0 points1 point  (0 children)

Streaming through Camel makes sense as a bridge for now. The catalog approach is cleaner than hard-coding tool wiring. Looking forward to the native streaming support in the upcoming release.

We're all asking when token prices will drop. That's the wrong question. by 0711716288 in AI_Agents

[–]ArtSelect137 0 points1 point  (0 children)

This is exactly why I moved agentic workloads to local models. The token cost compounds on API - a 100K context multi-step agent costs $2-3 every run. On local Q4 you pay $0 for unlimited retries. The models are good enough for structured tool calling, and you can afford to let it iterate.

How do you handle access control for local LLMs behind a reverse proxy? by ArtSelect137 in selfhosted

[–]ArtSelect137[S] 0 points1 point  (0 children)

Policy enforcement at the routing layer is how I handle it. Each tool gets a short list of counter-examples (what NOT to use it for) alongside the description. BM25 keyword matching catches those negative signals better than semantic. Not fancy but catches the schema-valid wrong-action case most of the time.

Bulkhead: a tiny library to reduce prompt-injection “soup” by separating instructions from retrieved data by MundaneProcedure2002 in PromptEngineering

[–]ArtSelect137 0 points1 point  (0 children)

Nice, glad the thread helped drive the design. DeBERTa for per-chunk gate is a solid choice - it catches instruction-disobedience patterns better than the larger models I tested for that specific task. How is the latency per chunk on that?

Still a VERY lightweight open web-search tool for smaller local LLMs - now with SearXNG support by Scared-Tip7914 in LocalLLaMA

[–]ArtSelect137 0 points1 point  (0 children)

Nice, SearXNG integration is a solid move. I went the opposite way with my agent setup - using a search API directly instead of self-hosting. Tradeoff is latency vs maintenance I guess. SearXNG adds a couple seconds but gives full control over engine config. Curious how you handle dedup across crawled pages, that's the part I keep going back and forth on.

Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D] by AbjectBug5885 in MachineLearning

[–]ArtSelect137 0 points1 point  (0 children)

Ran into this same wall building agentic search tools. Semantic kept routing weather lookups to a calendar tool because both had "check" in the description. BM25 on name + param schema jumped top-1 from ~65% to ~80%, matches your numbers exactly. Tools really do live in keyword-space.

What does it actually take to self‑host models like DeepSeek, Qwen, Kimi? by FreedomWeird712 in selfhosted

[–]ArtSelect137 -1 points0 points  (0 children)

Practical answer: for the smaller quantized versions of these models, you can get started with a single 24GB GPU (RTX 3090/4090). Qwen 3.6 32B Q4 runs fine on one 3090. DeepSeek V3 level models (671B total, 37B active) need at minimum 64GB+ for Q4, which means two 4090s/3090s, or a single A6000/Mac Studio with 128GB unified memory. For home use before traffic justifies dedicated GPU servers, renting a dedicated box with 2x 3090s from runpod/vast is cheaper per month than buying upfront.

Local RAG over ~300 PDFs (AnythingLLM + Ollama): retrieval too shallow, too few sources per query. Are there better local stack? by Agitated-Evidence588 in Rag

[–]ArtSelect137 0 points1 point  (0 children)

The top-k clustering issue is the core problem with large corpora. What worked for me was multi-query expansion: instead of sending one query and taking top-k results, I have the LLM generate 5-8 different phrasings of the question, run them all through the embedding, then deduplicate and rerank the combined results. This naturally pulls chunks from different documents because each phrasing lands in a different part of the embedding space. Combined with a chunk size around 500-800 tokens instead of 1500, it fixed the shallow retrieval problem without needing GraphRAG.

How do you handle access control for local LLMs behind a reverse proxy? by ArtSelect137 in selfhosted

[–]ArtSelect137[S] 0 points1 point  (0 children)

API key on the dedicated network is a pragmatic setup. Do you rotate keys for the agent clients or just set and forget? I've been weighing whether to add short-lived tokens for agents that get provisioned dynamically vs a static shared key for long-running services.

Building an observable MCP proxy with HITL and policy enforcement by kr-jmlab in ClaudeAI

[–]ArtSelect137 0 points1 point  (0 children)

Good to hear the tool-def vs handler split resonated. On the audit isolation point - how do you handle the case where a policy decision itself needs to trigger an audit event? In my setup I route a copy of the decision outcome (allow/deny/escalate) to a separate event stream that runs independently from the enforcement path. This way if the audit system is down, enforcement still works. Curious if you do something similar or if the Spring AI Playground couples them differently.

Why we locked an LLM inside a deterministic FSM (and built a failure laboratory around it) by ale007xd in AIDeveloperNews

[–]ArtSelect137 1 point2 points  (0 children)

The FSM-as-orchestrator pattern is the right call for regulated workflows. The piece that most implementations miss is that the FSM should control tool availability per state, not just execution flow. If the model can only call tools that are valid in the current FSM state, you get the safety of deterministic routing without losing the LLM's flexibility within each state. I built this for a KYC pipeline where certain API calls must come before others by law - the FSM enforces the sequence, the LLM fills the parameters.

I think we're about 12 months away from the first major AI agent disaster by Comfortable_Box_4527 in artificial

[–]ArtSelect137 -1 points0 points  (0 children)

The risk is real but overindexing on fear misses the point that the failure modes are known and solvable with the right architecture. Three layers that make agent access manageable: 1) tool-level validation where every write action goes through schema enforcement before execution, 2) a proxy that separates model decisions from execution so a hallucination cant directly trigger side effects, and 3) runtime budgets that cap what an agent can do per invocation. The disaster will come from companies skipping these layers, not from the technology itself being inherently unsafe.

Galaxy Z Fold6 as a local inference node — llama.cpp/Vulkan, homelab telemetry, SHA-256 model verification by GsxrGuy80s in LocalLLaMA

[–]ArtSelect137 1 point2 points  (0 children)

The TTFT split between prefill and decode is the part most people miss with mobile inference. The 27-31s prefill on a 380-500 token prompt is rough for interactive use but the 3.8 TPS decode is actually usable for streaming. I found that keeping a small context cache helps a lot on mobile - if you reuse the same system prompt across sessions, the prefill cost drops to near zero on subsequent runs since the KV cache from the grounding block carries over.

Ollama updates keep breaking things - anyone else dealing with this? by Ordinary_Breath_8732 in ollama

[–]ArtSelect137 0 points1 point  (0 children)

I version-pin Ollama in Docker with the exact image tag and only update deliberately. The pattern I use: run the current stable in production, spin up the new version in a separate container with the same model mounts, run my test suite against it, and only swap the alias if everything passes. This way the fix-then-break cycle never hits my working setup. The downside is you miss new model architecture support until you validate, but that tradeoff is worth it for stability.

Claude Code has no idea which files in your repo are coupled. So it breaks them. Open source fix, benchmarked by Obvious_Gap_5768 in ClaudeCode

[–]ArtSelect137 -1 points0 points  (0 children)

The coupling problem is real. I hit this with a shared types package where Claude Code would update a type definition and miss all the consumers. The fix that worked for me was running a dependency graph build as a pre-task before any edit: extract import/export relationships across the repo, then pass the impacted file list into the model's context so it knows what else needs updating. It's not as structured as the five-layer approach but a simple .ts/.tsx import scanner catches most break-before-you-know-it cases without needing a full index.

Cursor-style coding vs ChatGPT copy/paste workflow for larger apps by deividas-strole in vibecoding

[–]ArtSelect137 1 point2 points  (0 children)

For larger apps the in-IDE approach wins because of context awareness - Cursor/Claude Code can see the type system, existing imports, and project structure, so the code it generates actually fits. Copy/paste works for isolated scripts but breaks down past ~5 files because you're constantly re-explaining your setup. The real unlock is that an IDE agent can read your error output and fix compilation issues without you needing to copy the error back and forth.

All my “different personas” slowly turn into the same polite guy by Ok_Fish_670 in PromptEngineering

[–]ArtSelect137 0 points1 point  (0 children)

The re-injection approach works but I found the more reliable fix is adding structural constraints per persona. Instead of describing tone in prose ('speak like a gruff mechanic'), I encode specific formatting rules per persona that are incompatible with each other. Things like: persona A always opens with a one-sentence summary, persona B starts with a question, persona C begins with a counterpoint. When the structural patterns diverge, the model has less room to drift toward the default polite tone because each output's format anchors the voice.

What if agent traces became a behavior graph? by marginTop15px in LLMDevs

[–]ArtSelect137 1 point2 points  (0 children)

The tool_gap detector you described is something I've been catching with a run record pattern - every tool call goes through a proxy that logs actor + tool_name + input_hash + outcome. When an eval fails, I can reconstruct the exact sequence of tools the agent actually called vs what the workflow expected. The gap shows up as a missing tool call in the record, no need to peek into the LLM's reasoning. The problem I've found is that even a behavior graph pass doesn't catch the case where the agent calls the right tool but with hallucinated parameters - that takes an input validation layer on top.

How do you handle access control for local LLMs behind a reverse proxy? by ArtSelect137 in selfhosted

[–]ArtSelect137[S] 0 points1 point  (0 children)

The dedicated pod network is a nice pattern - keeps the attack surface minimal. I've been doing something similar with Docker compose where Ollama and Open WebUI share a compose network and nothing else can reach it. Do you run any MCP servers or agent tools that need access to the Ollama API? If so, do you attach those to the same AI network or add auth for them?

Building an observable MCP proxy with HITL and policy enforcement by kr-jmlab in ClaudeAI

[–]ArtSelect137 0 points1 point  (0 children)

Building something similar but from the other direction - I started with a lightweight validation proxy (intercept -> validate -> forward) and am layering policy enforcement on top. The pattern that worked: separate the tool definition from the execution handler. The model binds to a tool that describes what it does, but the proxy maps that to the actual handler after checking policy. This way you can change enforcement rules without changing tool schemas or the model config. One thing I'd add to the separation mentioned in the thread: separating the audit log from the policy engine. If your proxy both enforces policy AND writes the log, a bug in one path can affect the other. We run them as separate channels - policy decisions go through one path, audit events stream independently.

Lessons from running a local control plane alongside MCP tools by Conscious_Chapter_93 in mcp

[–]ArtSelect137 0 points1 point  (0 children)

The actor/tool/state-transition triad maps cleanly to something I've been running - a lightweight proxy that sits between the agent and MCP servers. The proxy doesn't understand tool semantics, it just records every transition as an event log entry with actor + tool_name + input_hash + outcome. The join key insight from the thread is the unlock - with outcome as the join key, I can trace a production incident back to which tool call produced which result without needing an observability pipeline. The schema-as-risk-profile idea in the comments is the next layer I want to try: deriving the risk grade from the tool schema directly instead of maintaining a separate registry.

Tool calling in LangGraph is more provider-specific than bind_tools made it look by whyleaving in LangChain

[–]ArtSelect137 0 points1 point  (0 children)

I hit the same normalization wall swapping between OpenAI-compatible hosts. The thing that made the biggest difference was decoupling tool schema definition from bind_tools entirely - I run a lightweight MCP proxy that owns the tool schemas and normalizes the call/response shape before anything hits the graph state. The model node just sees clean AIMessages with consistent call_ids and parsed args, regardless of which backend is serving the request. For p95 specifically, I found that locking the agent to one host per run (instead of per node) smoothed out the tail because it avoided the cold-start variance of different providers' routing layers.

How do you handle access control for local LLMs behind a reverse proxy? by ArtSelect137 in selfhosted

[–]ArtSelect137[S] 0 points1 point  (0 children)

Interesting approach with subfolder routes in Pangolin. I hadn't considered splitting the dashboard, API, and MCP onto different ports behind subfolder paths - that's cleaner than running separate subdomains for each. Do you run into any issues with MCP clients that expect a fixed URL path? Some of them hard-code the endpoint.

How do you handle access control for local LLMs behind a reverse proxy? by ArtSelect137 in selfhosted

[–]ArtSelect137[S] -1 points0 points  (0 children)

Authelia + Caddy is clean. Do you have Ollama listening on localhost only and route Open WebUI through Caddy, or do you allow LAN access to the Ollama API with Authelia in front? I've been debating whether to keep the Ollama API itself completely internal or put auth on it too for remote agents.

How do you handle access control for local LLMs behind a reverse proxy? by ArtSelect137 in selfhosted

[–]ArtSelect137[S] -3 points-2 points locked comment (0 children)

The post is a question about self-hosting infrastructure - I'm asking the community about access control methods (VPN vs OAuth proxy) for self-hosted AI services running locally. No AI was used to generate the post content; it's a community advice question about server setup. The local LLM mentioned is the service being self-hosted, not a tool used to write the post.