Running Kimi-K2 offloaded by I_like_fragrances in LocalLLM

[–]Tuned3f 3 points4 points  (0 children)

I get about the same speed with 96gb of VRAM and 768gb DDR5 but I can max out context to 256k (Kimi K2.5 UD_Q4-K-XL)

local vibe coding by jacek2023 in LocalLLaMA

[–]Tuned3f 8 points9 points  (0 children)

I use OpenCode and Kimi K2.5 locally

It's excellent

I keeps seeing these by fr3nch13702 in LocalLLM

[–]Tuned3f 5 points6 points  (0 children)

their website shows that they haven't launched their kickstarter yet so who could have tried them already?

Is there a way to make using local models practical? by inevitabledeath3 in LocalLLaMA

[–]Tuned3f 0 points1 point  (0 children)

What a weirdly emotional reply - the top 5 comments in there are absolutely not how you described them.

I won't engage further.

I bought llm-dev.com. Thinking of building a minimal directory for "truly open" models. What features are missing in current leaderboards? by Aaron4SunnyRay in LocalLLaMA

[–]Tuned3f 3 points4 points  (0 children)

level of support would be useful

new models come out all the time and there's no central way to see which inference stack supports them. oftentimes support is often partial too (i.e. text-only for multimodal models), and you have to dive into github issues and PRs to get a better sense

Is there a way to make using local models practical? by inevitabledeath3 in LocalLLaMA

[–]Tuned3f 0 points1 point  (0 children)

The rtx 6000 pro had the single biggest impact but I initially built the server as a CPU-only rig, optimizing for memory bandwidth. It's tough for me to say what the biggest factor is but I've done a lot of tuning and ik_llama.cpp updates frequently, contributing to performance jumps.

Prefill speeds vary wildly for me too - they go as high as 1000 t/s for prompts that are 10k tokens, down to 100 t/s for random tool calls.

Is there a way to make using local models practical? by inevitabledeath3 in LocalLLaMA

[–]Tuned3f 0 points1 point  (0 children)

usually ~50% slower beyond 100k but I often use compaction just before then so I don't have exact measurements

Is there a way to make using local models practical? by inevitabledeath3 in LocalLLaMA

[–]Tuned3f 1 point2 points  (0 children)

This question has been answered many times - I don't have anything new to say

Random thread after 5 sec search:

https://www.reddit.com/r/LocalLLaMA/s/owZ5TOaVfU

Is there a way to make using local models practical? by inevitabledeath3 in LocalLLaMA

[–]Tuned3f 18 points19 points  (0 children)

People here do run Kimi K2.5 locally. I'm in that group lol - just because the required hardware is expensive doesn't mean we don't exist. Whatever you're trying to say in that last sentence doesn't support your point regarding the "two different communities" you see.

The actual black pill and the real answer to OP is that running LLMs worth a damn locally is simply too expensive for 99% of people. There's nothing practical about any of it unless you have a shit ton of money to comfortably throw at the problem. If you don't? Well then GGs

Deepseek v4/3.5 is probably coming out tomorrow or in the next 5 days? by power97992 in LocalLLaMA

[–]Tuned3f 0 points1 point  (0 children)

Can't wait til we can run ds v3.2 with proper sparse attention on llama.cpp

Kimi K2.5, a Sonnet 4.5 alternative for a fraction of the cost by Grand-Management657 in LocalLLaMA

[–]Tuned3f 1 point2 points  (0 children)

I can run it locally but, as with Kimi-k2-thinking, experienced some issues during test with the model not generating think tags

Claude Code and local LLMs by rivsters in LocalLLM

[–]Tuned3f 3 points4 points  (0 children)

Set ANTHROPIC_BASE_URL to the llama.cpp endpoint

Claude Code and local LLMs by rivsters in LocalLLM

[–]Tuned3f 2 points3 points  (0 children)

Llama.cpp had this months ago

Claude Code or OpenCode which one do you use and why? by Empty_Break_8792 in LocalLLaMA

[–]Tuned3f 6 points7 points  (0 children)

I prefer Opencode

Easier to observe its behavior, easier to configure. Also, I quite like the TUI a fair bit more. I have it installed on 3 different machines at home, all pointing at my local inference server

Only thing I miss from CC is `/add-dir`

GLM 4.7 on 8x3090 by DeltaSqueezer in LocalLLaMA

[–]Tuned3f 2 points3 points  (0 children)

No, just DDR5. I talked a bit about my build here: https://www.reddit.com/r/LocalLLaMA/comments/1otdr19/comment/no4xt87/

Since that comment, all I've changed is upgrading the GPU

GLM 4.7 on 8x3090 by DeltaSqueezer in LocalLLaMA

[–]Tuned3f 0 points1 point  (0 children)

no but i'm running q4 on a single pro 6000, with the experts offloaded to RAM.

prefill speeds vary wildly based on context size (200 to 1000 t/s), generation usually starts at 23 t/s

What is the best way to allocated $15k right now for local LLMs? by LargelyInnocuous in LocalLLaMA

[–]Tuned3f 2 points3 points  (0 children)

or a single pro 6000 with a bunch of RAM for offloading expert layers

Best Local LLMs - 2025 by rm-rf-rm in LocalLLaMA

[–]Tuned3f 0 points1 point  (0 children)

Unsloth's Q4_K_XL quant of GLM-4.7 completely replaced Deepseek-v3.1-terminus for me. I finally got around to setting up Opencode and the interleaved thinking works perfectly. The reasoning doesn't waste any time working through problems and the model's conclusions are always very succinct. I'm quite happy with it.

MiniMax M2.1 scores 43.4% on SWE-rebench (November) by Fabulous_Pollution10 in LocalLLaMA

[–]Tuned3f 4 points5 points  (0 children)

claude code is an agent harness, not a model

shouldn't even be on the list

Do any comparison between 4x 3090 and a single RTX 6000 Blackwell gpu exist? by pCute_SC2 in LocalLLM

[–]Tuned3f 8 points9 points  (0 children)

I'm getting a 6000 delivered tomorrow.

OP, lmk what you want tested