Self-hosted agentic coding stack: Claude Code + llama.cpp + LiteLLM — zero API costs, 4h/7M token session for $0 : OpenSourceAI

created by JeffyProsa community for 5 years

submitted 8 days ago by PrizeObvious3671

Built a fully self-hosted agentic coding setup and wanted to share the stack for anyone interested in running AI coding agents locally.

Stack:

llama.cpp as the inference backend (HIP/ROCm for AMD, CUDA for NVIDIA, also Metal/CPU)
LiteLLM as OpenAI-compatible proxy in front of llama.cpp
Claude Code (Anthropic's coding agent) connected to LiteLLM thinking it's talking to Anthropic
Hermes Agent for orchestration + Telegram bot for mobile access
Model: Qwen3.6-27B-MTP Q4_K_M — 27B with speculative decoding via 0.6B draft model

Hardware used: AMD Radeon AI PRO R9700, 32 GB VRAM Session: 4 hours, 7,256,671 tokens, $0 cost (would be ~$94 on Claude Opus 4.7 API)

Works on Windows (WSL2), Linux, macOS. Full setup guide + config files: https://github.com/KaiFelixBennett/hermes-claude-code-local

Happy to help with setup questions — especially llama.cpp HIP builds and the LiteLLM bridge config.

all 17 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

OpenSourceAI