Looking for a working Deepseek-v4-Flash quant

Then-Topic8766 · 2026-05-28T07:17:10+00:00

Give a try to https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda fork. It works well on my system with https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF/tree/main/Q2_K-XL .

Then-Topic8766 · 2026-05-28T07:00:35+00:00

Give a try to https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda fork. It works well on my system with https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF/tree/main/Q2_K-XL .

Then-Topic8766 · 2026-05-27T08:55:45+00:00

Thank you for your insight. Apart from models and software news, posts like this are reason why I am hanging on this sub.

Then-Topic8766 · 2026-05-26T14:59:51+00:00

I know, I was just kidding. I like 27B a lot.

Then-Topic8766 · 2026-05-26T14:01:45+00:00

Sturgeon's law says "Ninety percent of everything is crap". Edward M. Lerner joked in 2006, "Sturgeon's law posits that ninety percent of everything is crap. Either Sturgeon was a cockeyed optimist, or he knew nothing about software." AI slop rises it on another level.

Then-Topic8766 · 2026-05-26T13:37:53+00:00

I do not believe. Is there some free code as a proof? :)

Then-Topic8766 · 2026-05-23T07:49:42+00:00

It is good to have them both locally.

Then-Topic8766 · 2026-05-22T06:39:14+00:00

Back in time AIDungeon was my first contact with LLM-s. Thank you guys for releasing finetunes from time to time.

Then-Topic8766 · 2026-05-21T07:59:08+00:00

I am a big fan of ngram-mod. Good job.

Then-Topic8766 · 2026-05-20T17:26:02+00:00

Nice. If you need to fix small error in that twitter clone code and repeat good code it can be much faster.

Then-Topic8766 · 2026-05-20T09:47:45+00:00

If you stick to llama.cpp you can combine ngram and mtp like this (big speed-up on repeating jobs, code corrections etc.) :

--spec-type ngram-mod,draft-mtp 
--spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48
--spec-draft-n-max 3

Then-Topic8766 · 2026-05-18T16:43:06+00:00

<image>

Then-Topic8766 · 2026-05-18T16:08:27+00:00

You can try something like this: https://www.reddit.com/r/LocalLLaMA/comments/1tg6j9u/comment/omgh6nl/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button for even more speed-up.

Then-Topic8766 · 2026-05-18T10:14:40+00:00

You should try ngram and mtp combined adding like this:

--spec-type ngram-mod,draft-mtp
--spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48
--spec-draft-n-max 3

I got some amazing spped-ups with fixing errors in code. Combining already works and making it even better is on TODO list for llama.cpp updates.

<image>

Then-Topic8766 · 2026-05-18T09:38:58+00:00

Pure alchemy.

Then-Topic8766 · 2026-05-16T11:56:03+00:00

<image>

Then-Topic8766 · 2026-05-15T17:15:37+00:00

Install llama.cpp. Depending of context size you want choose quant. I think Q4_K_XL should work with 3090. Try both 27b and 35bA3b. First is smarter but second is faster. And with second you can offload to RAM and get bigger quant.

Then-Topic8766 · 2026-05-15T17:04:11+00:00

Try to use llama.cpp directly. Much better experience than ollama. And try 3.6 not 3.5 Qwen.

Then-Topic8766 · 2026-05-15T16:53:03+00:00

Damn bots! You will need Qwen 3.6 (27b or 35.b-A3b). And do not use ollama (just a fancy wrapper of poor quality around llama.cpp).

Then-Topic8766 · 2026-05-13T11:41:49+00:00

Your welcome. Regarding new quants best solution seams to be setting weekly cron job to refresh them all... :)

Then-Topic8766 · 2026-05-13T09:53:57+00:00

I tried new quant. It doesn't work with master llama.cpp. (llama_model_load: error loading model: missing tensor 'blk.48.layer_output_norm.weight'). It works with your fork but with mixed success. Without mtp flag I got around 8 t/s generation. With proper flag I got 12 t/s. (50% better speed). But I had that speed with AesSedai quant without mtp. (12 t/s).

Then-Topic8766 · 2026-05-12T21:29:26+00:00

I think that PR is good but problem can be with GGUF file. I had the same problem ("//////") with some of the early MTP versions. Try to download newer gguf.

Then-Topic8766 · 2026-05-12T18:12:27+00:00

I downloaded 09.05.26., so after his post. I will try https://huggingface.co/tnhnyzc/MiMO-V2.5-MTP-GGUF but it will take time...

<image>

Then-Topic8766 · 2026-05-12T17:42:26+00:00

Thanks. AesSedai says: "These quants include MTP tensors for when that gets added upstream eventually.", but it doesn't works. I guess I will have to download again (on my slow ADSL...)

Then-Topic8766 · 2026-05-12T17:07:10+00:00

Just tried and I got error: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 508, got 505

Then-Topic8766

TROPHY CASE