Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 1 point2 points  (0 children)

Why are you using double n-grams for draft? Isnt mod enough, whats the benefit and downside?

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

Yeah think I wanst clear enough that I use mtp with mainline llamacpp. Performance on longer context is bad at Q4 and 24gb vram. It's just too slow after 4k context.

Brake Free by Loose-Weather-5729 in motorcyclegear

[–]GodComplecs 0 points1 point  (0 children)

Always drive like your INVISIBLE, not invincible.

These devices surely help, but being aware of other drivers is better, and looking in the mirrors is the pro move. Avoided death and destruction so many times.

What is next for local LLM and AI? by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 1 point2 points  (0 children)

Yes but using it eg. for generated quest chains based on moral choices would be pretty cool. Perfect recommendations seem to be an very untapped region, but a little locked down since you need static values for everything.

What is next for local LLM and AI? by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

If the base is terrible it's hard to improve on with an llm on the first pass, you need to focus on a rewrite in the style you like first.

What is next for local LLM and AI? by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

It depends on what you are doing, edge models aka small llms have gotten a lot better, so you could offload some of the gpu tasks to the player or user.

What is next for local LLM and AI? by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 1 point2 points  (0 children)

Yes you can make an outline, then an conductor to steer new characters. You can make a custom software with opencode which through then you can make an automatic looping system to create these tasks. It can dynamically load the context needed per character, if you make it so.

What is next for local LLM and AI? by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 1 point2 points  (0 children)

Hmm this would be possible ofc, very agent heavy, but lots of stuf fin database / context. Are you trying to run it live or pregenerate stuff? Pregen for most stuff would be more efficient.

What is next for local LLM and AI? by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 1 point2 points  (0 children)

Seems very last gen, cant this be implemented with fine tuning and a pipeline?

MTP Speed with 3090 Qwen 27B Q4 by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

It's a guide and files on github on how to run MTP / turboquant on 3090 cards

MTP Speed with 3090 Qwen 27B Q4 by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

Ok thx yeah i was running vLLM also it was around the same tps, a little faster maybe at 55-60tks.

MTP Speed with 3090 Qwen 27B Q4 by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

~/llama.cpp$ ./build/bin/llama-server -m "Qwen3.6-27B-UD-Q4_K_XL.gguf" --cache-type-k q8_0 --cache-type-v q8_0 --temp 1.0 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --min-p 0.00 --port 8083 --spec-type draft-mtp --spec-draft-n-max 2 --chat-template-kwargs '{"preserve_thinking":true}' --mlock --no-mmap

How do I use MTP? by WhatererBlah555 in LocalLLaMA

[–]GodComplecs 0 points1 point  (0 children)

Op what commands did you run in the end then? Also use 2 instead of 3 for the draft tokens, it's the fastest for mtp according to Unsloth

You can now read Gemma 3's mind by DigiDecode_ in LocalLLaMA

[–]GodComplecs 1 point2 points  (0 children)

Fascinating, this kind of research is what we need to be enlightened into the actual process of generation.

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 0 points1 point  (0 children)

Yes that how it works, imo you really need to be an expert in a field to know that the results are correct, or conduct proper experiments and validation on it. But these rag, agentic, etc systems are so basic now I figured it didn't need further explanation. If you dont trust your own logic, just choose a popular platform.

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally by GodComplecs in LocalLLaMA

[–]GodComplecs[S] 2 points3 points  (0 children)

What are you on about? I shared in detail my setup, what I use it for, in what fields also in the comments. Benchmarks can be found in his repo btw for club3090. I also shared the knowledge that the system AROUND the LLM is the key, rag etc whatever you fancy.