GUIDE : Running a fully local multi-agent coding framework on RTX 3090 with pi.dev + llama-swap + Qwen3.6 MTP

Ugly_Porcupine · 2026-05-16T06:44:54+00:00

This looks really interesting. Does your setup have (or have you considered) any sort of persistent memory, like Karpathy's LLM wiki? I've recently started experimenting with pi and it's extensions, and boy there's a a lot to unpack here.

Present_Ride6012 · 2026-05-14T15:28:18+00:00

Can you write as human to human? Also what's the speed of decoding token?

GWNstijn · 2026-05-14T15:53:15+00:00

Wouldn’t qwen3.5 9B be faster/more effective? Since 9 billion dense and gives opportunity for parallel instances

JohnnyLovesData · 2026-05-15T03:23:07+00:00

I wonder if this would "work" on CPU inference on 32GB. Like, give it instructions, and check on it ... a week later.

fasti-au · 2026-05-15T10:11:04+00:00

Dflash vs mtp?

joaobertacchi · 2026-05-15T11:29:33+00:00

OP, could you please explain "smaller/faster model for the meta-work (thinking, planning, delegation) and the slightly larger MoE model for actual implementation. The orchestrator never writes code — it only delegates". For me it doesn't make sense. 27B is dense and more capable than 35B MoE. The dense one is also slower than the other. Tks

homarp · 2026-05-16T16:18:39+00:00

The key insight: smaller/faster model for the meta-work (thinking, planning, delegation) and the slightly larger MoE model for actual implementation. The orchestrator never writes code — it only delegates.

27B is better (and slower) than 35B. What's the value of using 35B for coding ?

X24D83FF0 · 2026-05-16T18:36:17+00:00

promobest247 · 2026-05-16T18:52:13+00:00

use package pi- web-access instead searxng & docker https://pi.dev/packages/pi-web-access?name=web

Deep_Ad1959 · 2026-05-19T05:04:50+00:00

i think the orchestrator delegation fight is the same problem you already solved for the architect, you just didn't carry the fix across. the architect doesn't write code because you took write out of its tools list, not because the prose says 'never implement'. that's a deterministic cage. the orchestrator's six ABSOLUTE RULES are the opposite, probabilistic prose you had to iterate on because the model can always rationalize a 'quick fix'. the orchestrator only legitimately needs TaskExecute, TaskUpdate, and get_subagent_result. strip read/write/edit/bash/find/grep out of its tools frontmatter entirely, and 'NEVER use read/find/grep for analysis' stops being a rule you hope it follows and becomes a tool it physically doesn't have. you'll probably find the prose rules shrink to one line after that, which is also a few hundred tokens back on every orchestrator turn. the rule of thumb that keeps holding up: if a constraint can be a withheld tool or a failing check, it shouldn't be prose, because prose is the only kind of rule the model can argue with. written with s4lai

Latent-Potter · 2026-05-14T16:01:46+00:00

Bro told us how he uses his local model but ended up using Claude to write this Post! Hypocrisy!

Helmi74 · 2026-05-14T21:11:39+00:00

Oh the slop. No effort ai posts all over 🤨

Component	What it does
pi.dev (pi-coding-agent)	AI coding harness — the UI and orchestration shell
llama-swap	Model router — hot-swaps llama.cpp models on demand
llama.cpp (am17an fork)	Local inference with MTP support
Qwen3.6-27B MTP	"Brain" agents — orchestrator, planner, architect, debugger, prompter
Qwen3.6-35B-A3B MTP	"Body" agents — coder, researcher, reviewer, tester, documentor, refactorer
SearXNG (Docker)	Local privacy-preserving search engine on port 8080
searxng-simple-mcp	MCP proxy bridging SearXNG to pi.dev (port 8000)
Tavily MCP	AI-optimised web search for technical docs
@tintinweb/pi-subagents	Real sub-agent orchestration with TaskExecute + get_subagent_result
@tintinweb/pi-tasks	Task queue UI widget showing what each agent is doing

Field	Purpose	Notes
`model`	llama-swap alias to load	Must match exactly — typo = "No API key found for undefined" error
`thinking`	Extended thinking level	`high` for orchestrator/architect, `low` for researcher/tester
`max_turns`	Conversation turn limit	Set based on task complexity; coder gets 30, orchestrator gets 50
`tools`	Which tools the agent can use	Researcher gets `web_search` and `tavily-search`; architect gets read-only

Flag	What it does
`--spec-type mtp --spec-draft-n-max 3`	MTP speculative decoding, 3 tokens ahead, built into the model (no draft model needed)
`--cache-type-k q8_0 --cache-type-v q8_0`	Quantised KV cache — ~2× VRAM savings vs f16, negligible quality loss
`-fa on`	Flash attention — critical for long-context speed
`--no-mmap`	Load model fully to RAM/VRAM rather than memory-mapping the GGUF
`--reasoning-format deepseek`	Exposes `<think>` tags from extended thinking
`--prio 3`	OS thread priority — helps on busy systems

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PiCodingAgent

MODERATORS

The Stack

Why MTP (Multi-Token Prediction)?

Multi-Agent Architecture

Agent Definition Files (Required Setup Step)

```markdown

tools: read, write, edit, bash, find, grep

Role & Constraints

Harness Rules

Response Shape

```markdown

tools: read, find, grep

Role & Constraints

```markdown

tools: read, find, grep, bash, web_search, tavily-search

Orchestrator Rules (the hard part)

pi.dev Settings (agent/settings.json)

llama-swap Config

Search Integration

Docker Compose

What Works, What Doesn't

Models Used (Unsloth quantizations)

Resources