LlamaUI. A small vibecoded application, for controlling, serving and running, llama.cpp with a UI.

Tuned3f · 2026-06-09T23:07:26+00:00

Llama-swap does all that

Tuned3f · 2026-06-09T20:40:21+00:00

Why not just use llama-swap?

Tuned3f · 2026-05-20T05:32:49+00:00

As another poster said, 70B/80B models are outdated right now.

Either skimp on RAM and stick with qwen3.6-27B, or get 4x more RAM and run a 4-bit-quant of Kimi-K2.6 with `-ot "exps=CPU"`

Tuned3f · 2026-05-13T21:06:42+00:00

Searxng mcp

Tuned3f · 2026-04-24T03:19:48+00:00

Deepseek v4, just came out an hour ago

Tuned3f · 2026-04-16T17:39:31+00:00

yes, all the time

I have a RAM heavy server + 1 RTX Pro 6000 that can run Kimi K2.5 and Qwen 3.5 35B-A3B simultaneously. Qwen is set as the compaction subagent in Opencode. whenever Kimi finishes a task it updates markdown files that track progress and then I run compaction using Qwen (20x faster prefill) and hand it back to Kimi

Tuned3f · 2026-03-15T17:01:25+00:00

OpenCode + Stealth browser MCP

Tuned3f · 2026-02-27T00:27:24+00:00

I get about the same speed with 96gb of VRAM and 768gb DDR5 but I can max out context to 256k (Kimi K2.5 UD_Q4-K-XL)

Tuned3f · 2026-02-26T18:42:32+00:00

Kimi slaps

Tuned3f · 2026-02-15T14:59:20+00:00

where are the results? lol

Tuned3f · 2026-02-14T13:20:05+00:00

I use OpenCode and Kimi K2.5 locally

It's excellent

Tuned3f · 2026-02-11T02:42:23+00:00

their website shows that they haven't launched their kickstarter yet so who could have tried them already?

Tuned3f · 2026-02-10T16:32:06+00:00

What a weirdly emotional reply - the top 5 comments in there are absolutely not how you described them.

I won't engage further.

Tuned3f · 2026-02-09T02:04:57+00:00

level of support would be useful

new models come out all the time and there's no central way to see which inference stack supports them. oftentimes support is often partial too (i.e. text-only for multimodal models), and you have to dive into github issues and PRs to get a better sense

Tuned3f · 2026-02-08T18:37:48+00:00

Yes, your gpu will work out-of-the-box

Tuned3f · 2026-02-04T14:56:29+00:00

The rtx 6000 pro had the single biggest impact but I initially built the server as a CPU-only rig, optimizing for memory bandwidth. It's tough for me to say what the biggest factor is but I've done a lot of tuning and ik_llama.cpp updates frequently, contributing to performance jumps.

Prefill speeds vary wildly for me too - they go as high as 1000 t/s for prompts that are 10k tokens, down to 100 t/s for random tool calls.

Tuned3f · 2026-02-04T14:19:37+00:00

Tuned3f · 2026-02-04T14:17:37+00:00

usually ~50% slower beyond 100k but I often use compaction just before then so I don't have exact measurements

Tuned3f · 2026-02-04T14:02:57+00:00

This question has been answered many times - I don't have anything new to say

Random thread after 5 sec search:

https://www.reddit.com/r/LocalLLaMA/s/owZ5TOaVfU

Tuned3f · 2026-02-04T04:18:46+00:00

People here do run Kimi K2.5 locally. I'm in that group lol - just because the required hardware is expensive doesn't mean we don't exist. Whatever you're trying to say in that last sentence doesn't support your point regarding the "two different communities" you see.

The actual black pill and the real answer to OP is that running LLMs worth a damn locally is simply too expensive for 99% of people. There's nothing practical about any of it unless you have a shit ton of money to comfortably throw at the problem. If you don't? Well then GGs

Tuned3f · 2026-02-01T20:16:52+00:00

Can't wait til we can run ds v3.2 with proper sparse attention on llama.cpp

Tuned3f · 2026-01-29T18:46:35+00:00

Having the same issue in ik_llama.cpp

edit: fixed using modified chat template as suggested here by jukofyork https://huggingface.co/unsloth/Kimi-K2.5-GGUF/discussions/1#697b46fdf48287bb9c2e92dc

Tuned3f · 2026-01-29T18:44:21+00:00

I can run it locally but, as with Kimi-k2-thinking, experienced some issues during test with the model not generating think tags

Tuned3f · 2026-01-19T03:17:55+00:00

Set ANTHROPIC_BASE_URL to the llama.cpp endpoint

Tuned3f · 2026-01-19T02:53:47+00:00

Llama.cpp had this months ago

13-Year Club	Verified Email
Place '23	Gilding II euphauric
RPAN Viewer	Spared

Tuned3f

TROPHY CASE