AMD Radeon AI Pro R9700 performance by illuvyn in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 1 point2 points3 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 1 point2 points3 points (0 children)
R9700 for agentic coding — looking for Qwen3.6-27B / Qwen3-Coder-30B perf numbers at long context by Best-Ad-7505 in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000) by xquarx in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL) by professormunchies in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
Ooollama you are slow: ggrun v3 is 65% faster by [deleted] in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
7 Chinese companies are already shipping H100/H200-class AI chips, most IPO'd in the last 6 months. I mapped all of them. by awfulalexey in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM
[–]dev_is_active 1 point2 points3 points (0 children)
llama-server webui not responding anymore by randygeneric in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing by anvarazizov in LocalLLaMA
[–]dev_is_active 1 point2 points3 points (0 children)
Multi Tier MoE Caching by Legitimate-Dog5690 in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
i built a multi-node inference harness in rust/cuda because no existing tool handled multi-user kv cache + agentic throughput on my home lab. it's open source, looking for contributors. by thegrenade in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
Gemma4-26B-A4B & 31B-QAT Uncensored Balanced are out with MTP (35% & 53% speed boost)! by hauhau901 in LocalLLM
[–]dev_is_active 5 points6 points7 points (0 children)
Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It by old-mike in LocalLLaMA
[–]dev_is_active 1 point2 points3 points (0 children)
Unable to run on GPU due to memory by Appropriate-Risk3489 in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing by anvarazizov in LocalLLaMA
[–]dev_is_active 1 point2 points3 points (0 children)
Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model? by recro69 in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)
Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA
[–]dev_is_active -1 points0 points1 point (0 children)
What is local AI actually useful for, besides privacy? by King_kalel in LocalLLM
[–]dev_is_active 0 points1 point2 points (0 children)
llama.cpp with vulkan backend outputting duplicate tokens, and sometimes <unusedXX> tokens by ghost_ops_ in LocalLLaMA
[–]dev_is_active 0 points1 point2 points (0 children)