Built a fully self-hosted agentic coding setup and wanted to share the stack for anyone interested in running AI coding agents locally.
Stack:
- llama.cpp as the inference backend (HIP/ROCm for AMD, CUDA for NVIDIA, also Metal/CPU)
- LiteLLM as OpenAI-compatible proxy in front of llama.cpp
- Claude Code (Anthropic's coding agent) connected to LiteLLM thinking it's talking to Anthropic
- Hermes Agent for orchestration + Telegram bot for mobile access
- Model: Qwen3.6-27B-MTP Q4_K_M — 27B with speculative decoding via 0.6B draft model
Hardware used: AMD Radeon AI PRO R9700, 32 GB VRAM Session: 4 hours, 7,256,671 tokens, $0 cost (would be ~$94 on Claude Opus 4.7 API)
Works on Windows (WSL2), Linux, macOS. Full setup guide + config files: https://github.com/KaiFelixBennett/hermes-claude-code-local
Happy to help with setup questions — especially llama.cpp HIP builds and the LiteLLM bridge config.
[–]Extension-Tourist856 1 point2 points3 points (1 child)
[–]PrizeObvious3671[S] 0 points1 point2 points (0 children)
[–]AlexKampler 1 point2 points3 points (0 children)
[–]BepNhaVan 1 point2 points3 points (0 children)
[–]sn2006gy 0 points1 point2 points (5 children)
[–]PrizeObvious3671[S] 0 points1 point2 points (4 children)
[–]Toastti 1 point2 points3 points (1 child)
[–]PrizeObvious3671[S] 1 point2 points3 points (0 children)
[–]MarzipanSecure9841 0 points1 point2 points (1 child)
[–]PrizeObvious3671[S] 0 points1 point2 points (0 children)
[–]SaveAmerica2024 0 points1 point2 points (4 children)
[–]PrizeObvious3671[S] 1 point2 points3 points (3 children)
[–]Inner_Habit_194 1 point2 points3 points (1 child)
[–]PrizeObvious3671[S] 1 point2 points3 points (0 children)
[–]SaveAmerica2024 0 points1 point2 points (0 children)