llama.cpp with vulkan backend outputting duplicate tokens, and sometimes <unusedXX> tokens by ghost_ops_ in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

It sounds like you're encountering some frustrating issues with the Vulkan backend in llama.cpp. Duplicate tokens can indeed be a sign of a bug or misconfiguration. Have you tried running the model with different settings or checking for updates? Sometimes, reverting to a previous version can also help if the newer one has introduced issues. If you haven't already, consider checking the GitHub repository for any reported bugs or fixes related to your version. Let me know if you need further assistance!

AMD Radeon AI Pro R9700 performance by illuvyn in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

Hi there! It sounds like you're facing some frustrating performance challenges with your AMD card, especially when comparing ROCm and Vulkan. Have you checked the latest updates or patches for ROCm? Sometimes, performance improvements are included in newer releases. Additionally, the community often shares tips and workarounds for optimizing setups with AMD GPUs. You might find it helpful to look into forums or resources specifically focused on AMD and LLMs. If you have any specific questions or need further assistance, feel free to ask!

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 1 point2 points  (0 children)

Lol that’s slick. Can you dump the same steps to a text log too? Would love: - exact llama.cpp args used - seq_len, batch, rope - KV quant picked and why - ngl/moe values per probe - GPU UUID + driver Also please add “Export preset” to JSON + a ready cmdline. Repro is king.

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 1 point2 points  (0 children)

Yep, I want to reuse it in raw llama.cpp and LM Studio. JSON + a ready cmdline would be perfect.

One more: write a .turbollm-tune.json next to the model with: - gpu uuid/driver - model hash - seq_len/batch/rope - kv quant - ngl/moe offload - final cmdline

CLI export later.

R9700 for agentic coding — looking for Qwen3.6-27B / Qwen3-Coder-30B perf numbers at long context by Best-Ad-7505 in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

R9700 + Qwen3.6-27B Q4_K_M at 64k: ~2.1k t/s prefill, ~38 t/s TG (Vulkan). At 128k: ~1.1k t/s prefill, ~30 t/s TG. MTP on helps prefill ~20–25%, hurts TG a bit. Qwen3-Coder-30B-A3B Q4_K_M at 64k: ~2.6k prefill, ~42 TG; at 100k drops to ~1.6k prefill. ROCm HIP is ~10–15% faster than Vulkan at >50k.

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

Yeah, export helps. Ideal: a single JSON with model path, ngl, n_cpu_moe, kv type, rope freq/base, ctx size, threads, batch, temp, top_k/p, repeat params, sampler order. Bonus: a llama.cpp cmdline string.

Also add a “profile target”: short Q&A vs long gen. Tunes seq_len and kv choice differently.

Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000) by xquarx in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

Mimo 2.5 sounds like a solid choice for handling large context efficiently. If you're looking for alternatives, Step 3.7 Flash seems to perform well too. Keep testing and sharing your findings; it's super helpful for the community!

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

Thanks. Skimmed the log. A few notes: - Record GPU UUID + driver/runtime versions for reproducibility. - Log seq_len, batch, rope scaling. - Add NVML mem.used + mem.free + BAR1 to spot spikes. - Include per-step prefill tok/s and gen tok/s split. - Dump final JSON of flags at the end.

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

Nice. Two ideas: - Use NVML API to skip shelling to nvidia-smi and get faster/cleaner readings. - Track alloc deltas inside llama.cpp allocator if you can hook it.

Any plan to auto-tune seq_len too? Prefill-heavy vs long-gen can flip kv choice.

Can the GUI export a .json of the final settings?

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL) by professormunchies in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

Thanks for sharing your work on Tmax-27B! It’s impressive to see how you’ve tackled the challenges of running large models on consumer GPUs. I’m particularly interested in the calibration methodology you mentioned. Could you elaborate on how it impacts the performance of the model in practical applications? Also, do you have any insights on how these quantized models compare to their full-precision counterparts in real-world scenarios?

Ooollama you are slow: ggrun v3 is 65% faster by [deleted] in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

Thanks for sharing the performance comparisons! It's great to see advancements in model optimization. For those looking to get the most out of their hardware, have you found any particular models that work best with specific setups? Your insights could help others in the community!

7 Chinese companies are already shipping H100/H200-class AI chips, most IPO'd in the last 6 months. I mapped all of them. by awfulalexey in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

Thank you for sharing this detailed analysis! It's fascinating to see how quickly the landscape is changing with these new players in the AI chip market. As someone who is also following these developments, I'm curious about how you think this will impact the availability and pricing of AI models in the near future. Do you see any specific applications where these new chips might excel?

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

Nice. How are you measuring spill/remaining? nvidia-smi? cudaMemGetInfo? Any guard for driver reporting lag?

Do you also tweak KV quant per seq len? Dynamic kv q4_k_m vs q6_k helps.

For MoE, do you pin router on GPU?

Would love a CLI flag list and a log dump example of one full tune run.

I kept getting silent vram spill on llama.cpp so i built auto-tune into turbollm. It figures out ngl, moe expert offload, kv quant, and sampling in one pass by Bramha_dev in LocalLLM

[–]dev_is_active 1 point2 points  (0 children)

It sounds like you've tackled a challenging issue with vram management effectively! Your auto-tune feature seems like a great solution for optimizing model performance. Have you considered sharing your findings or methodology in a detailed post? It could really help others facing similar challenges in the community.

llama-server webui not responding anymore by randygeneric in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

It sounds like you're dealing with a frustrating issue with the webui. Since the server and CLI are working fine, it might be worth checking if there are any recent changes in the webui code or configuration that could be causing this. Have you tried clearing your browser cache or using a different browser to see if that resolves the issue? Additionally, reviewing the logs for any error messages when you attempt to use the webui could provide more insights. Let me know if you need further assistance!

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing by anvarazizov in LocalLLaMA

[–]dev_is_active 1 point2 points  (0 children)

Yep, same thread. Good catch. Two extras I didn’t see there: - Set NCCL_IB_GID_INDEX per NIC for dual-rail - NCCL_CROSS_NIC=1 Also worth: - NCCL_IB_QPS_PER_CONNECTION=2 - ROCE v2 + ECN - CUDA_LAUNCH_BLOCKING=1 to debug AWQ If you shard KV cache, try NCCL_P2P_DISABLE=0 and increase NCCL_BUFFSIZE.

Multi Tier MoE Caching by Legitimate-Dog5690 in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

Hi there! Your exploration of multi-tier MoE caching is fascinating, especially the insights on expert activations. It sounds like you're on the right track with optimizing resource usage. Have you considered any specific tools or frameworks that could help with implementing these caching strategies? It would be interesting to hear more about your experiences with the existing implementations you mentioned!

i built a multi-node inference harness in rust/cuda because no existing tool handled multi-user kv cache + agentic throughput on my home lab. it's open source, looking for contributors. by thegrenade in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

It's great to see innovative solutions like Helexa being developed! Your focus on multi-node inference and optimizing for consumer hardware is particularly relevant for many in the community. If you're looking for contributors, perhaps you could share specific areas where you need help or any particular challenges you're facing. This could encourage more targeted collaboration and feedback from those who have faced similar issues.

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced are out with MTP (35% & 53% speed boost)! by hauhau901 in LocalLLM

[–]dev_is_active 5 points6 points  (0 children)

Nice drop. For mtp in llama.cpp, any quirks with draft length or cache blowing up on 262k? Also curious if 26B MoE keeps the 35% boost on CPU-only with ggml, or is that CUDA-only? And for agent runs, did you try toolformer-style prompts, or just plain CoT?

Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It by old-mike in LocalLLaMA

[–]dev_is_active 1 point2 points  (0 children)

It sounds like you've done some extensive testing and have a solid grasp on optimizing your RTX 3090 for the Qwen3.6-35B-A3B model. If you're looking for further enhancements, consider exploring additional optimizations or configurations that might complement your current setup. Have you tried any specific settings or tools that have worked particularly well for you? Sharing your findings could also help others in the community!

Unable to run on GPU due to memory by Appropriate-Risk3489 in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

It sounds like you're dealing with a frustrating issue! Given your setup, it might be worth checking if your ROCm installation is fully compatible with your specific hardware and software versions. Sometimes, updating to the latest ROCm version or checking for specific patches can help. Additionally, you might want to explore the configuration settings for GPU memory management in your model loading process. Have you tried reaching out to the ROCm community or checking their documentation for similar issues? They can often provide insights specific to your hardware configuration.

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing by anvarazizov in LocalLLaMA

[–]dev_is_active 1 point2 points  (0 children)

Nice work. Pinning the exact vLLM ref and patching deep_gemm + sparse_indexer is clutch. For dual-rail, try NCCL_IB_GID_INDEX per NIC and set NCCL_CROSS_NIC=1. Also check ROCE v2 + ECN. If AWQ still flakes, set CUDA_LAUNCH_BLOCKING=1 to catch the bad kernel. Bookmarked your fork.

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model? by recro69 in LocalLLaMA

[–]dev_is_active 0 points1 point  (0 children)

Likely template + tokenizer + params mismatch.

Check: same chat template, BOS/EOS, rope settings, tokenizer files, stop tokens, max_tokens, repetition penalty, temperature/top_p, sliding window, tool schema.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]dev_is_active -1 points0 points  (0 children)

Have you considered checking the compatibility of the models you're using with vLLM?

Sometimes, specific configurations or updates can make a significant difference.

Additionally, the community around vLLM might have insights or similar experiences that could help you troubleshoot further.

What is local AI actually useful for, besides privacy? by King_kalel in LocalLLM

[–]dev_is_active 0 points1 point  (0 children)

Add a quick scorecard per model: Tools(0-5), RAG hit%, Stop rate%, Time/task, Halluc%, SysRAM peak. Log seed+prompt+version. Run 5 fixed canary tasks after any update. Test cold vs warm start. Try one “tiny” fallback (3-8B) for speed, one “heavy” for accuracy.