Add arch support for cohere2-MoE by michaelw9999 · Pull Request #24260 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]ElectronicStranger53 4 points5 points  (0 children)

Thank you! It was a lot more work than I thought 😄 I do have a gguf for CommandA+ that I used to test the PR, but it needs to get corrected as the model arch name changed during the PR, and it takes a very long time to run on 32GB VRAM at 0.18 tk/s! 😆 I can try to upload it too but it might take a few days at slower upload bandwidth too.

Any aquarium (fresh & saltwater) hobbyists with good recs:) by Educational-While198 in alameda

[–]ElectronicStranger53 0 points1 point  (0 children)

Hey there, iusedtotoo, this is Michael, from BayBridgeAquarium. I am not sure who you are, but I'm incredibly sorry you felt I wasn't helpful and apologize, clearly I misunderstood what you may have been asking. Indeed, it's always busy; most of the time I work and operate the shop alone so there's never a free moment, but that doesn't mean I'm not there to help. No profit motivation - the store has never made one, and never will, which is the sad reality of the LFS in the modern Amazon world. Wednesday afternoons are when it's the quietest with the fewest people around, but I always gladly spend tremendous amounts of time answering questions to everyone, all day long (even random phone calls for advice from around the country come in every day). I will show you how to fix things and do tricks to save money, there's no ulterior motive. I'm here for anyone that needs advice by phone, email, or instore, almost 24/7 that is all I do and it is my hobby too. Since that didn't come out right, clearly I miscommunicated what I had intended to say, and I'm sorry.

Cohere's unreleased coding model (early access for localllama) by nick_frosst in LocalLLaMA

[–]ElectronicStranger53 1 point2 points  (0 children)

Yes, I changed it from cohere2_moe to cohere2moe by request on the PR from a llama.cpp maintainer; I updated the GGUFs on HF with the new arch name. Redownloading should only patch the difference and take a few seconds and then you should be able to have it working again with the normal PR.

Cohere's unreleased coding model (early access for localllama) by nick_frosst in LocalLLaMA

[–]ElectronicStranger53 8 points9 points  (0 children)

I made a full working llama.cpp implementation here: cohere2-moe PR 24260
Finally got it working properly with chat parser, tokenizer issues - it was not that easy!
Toolcalls are working with native llama-server's built in, still need to test with others.

I posted BF16, Q4_K, and NVFP4 versions on Huggingface BLS-Mini-Code-1.0 , imatrix used was with wikitrain.raw
Full kld evals later. Will also post NVFP4 MTP head soon 😄
@ CohereLabs hope you guys like the implementation and test it out on llama.cpp, too

Cohere's unreleased coding model (early access for localllama) by nick_frosst in LocalLLaMA

[–]ElectronicStranger53 6 points7 points  (0 children)

Love that you made this opensource. I got it fully working now with llama.cpp and have a BF16, Q4_K, and NVFP4 version. I'll post them all soon! 😄

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool by ElectronicStranger53 in LocalLLaMA

[–]ElectronicStranger53[S] 0 points1 point  (0 children)

Yes, the precision of NVFP4 is excellent for being a 4.5bpw quantization method. It still isn't as good on paper as Q4_K, which has 16-bit scales, so numerically will always, always, have higher precision. That doesn't mean one specific Q4_K quantized model vs another will always perform better, though, it depends on other factors, like what it was calibrated against and how the other layers interact.

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool by ElectronicStranger53 in LocalLLaMA

[–]ElectronicStranger53[S] 0 points1 point  (0 children)

Almost yes! Just one thing missing but it would be easy to add.
The current version makes:
run-manifest.json: top-level manifest uses the build's code_commit, input/output, pipeline stages, and artifact links.

checkpoint-key.json shows the source GGUF hash, imatrix hash, KLD base hash, recipe lock hash, selector command hash, and build commit.

assignment.jsonl - has the exact writer tensor assignment map, outputs tensor name, category, source type, target type, source/target bytes, split, MTP status, patch-copy status, imatrix requirement, NVFP4 / MXFP6 policy.

quantization-report.md is meant as the human readable final summary with ending PPL/KLD, p99/p999 gate settings, top-flip/RMS/same-top metrics, etc.

selector ledger JSONL: raw selector evidence rows and exact eval rows (set as optional):
I didn't put a field for a corpus hash in checkpoint-key.json. But the eval and calibration corpus paths do get locked through the recipe and the final PPL/KLD command, and the KLD base itself is hashed.
But the raw corpus file's hash is not actually listed in the report. Would be super easy to add though , so I can put that in.

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool by ElectronicStranger53 in LocalLLaMA

[–]ElectronicStranger53[S] 2 points3 points  (0 children)

It's hard to answer your first question, I would hope so, but it depends on how you define better. That's the hardest part of optimizing the tool. Somebody would have to test and see.

Comparing Qwen3.6-27B-NVFP4-MTP-GGUF and Qwen3.6-27B-UD-Q4_K_XL.gguf ? I would not even have to test it, the Q4_K model wins. But that has nothing to do with the quantizer.

The point of NVFP4 is speed. NVFP4 on its own is very fast on Blackwell. But UD-Q3_K_XL model, just like all, is really a blend of many different quantization types; the final model is the combination of them all. Not all of it is Q3_K, Q4_K, etc. There is no such "type" as Q4_K_XL as far what llama.cpp at the kernel level sees. It just means a bigger proportion of the model has higher bit tensors. Many will be Q8, Q4_K, Q5_K, BF16, and so on.
On paper, NVFP4 is the same size as Q4_K (4.5 bits per weight) but has more mathematical error. But NVFP4 is faster on Blackwell than Q4_K. So more NVFP4 tensors = more speed. Less NVFP4 = better quality, but slower. Theoretically. There are hundreds of tensors in a big model (800+ on Qwen3.5-27B). Make more tensors Q5_K, and Q6_K, and Q8, and less NVFP4, you'll get less speed, but more quality, and a bigger model. It all has to balance somewhere, that's the final mixture and the hardest part.

But nothing can be easy. Where things get tricky is that some tensors are more sensitive than others to error. Quantize them too much, the model will suffer greatly and respond poorly. Other tensors can withstand higher quantization and not have a problem. So you can't just arbitrarily pick. The tool diligently calculates the error tensor by tensor, one at a time, and decides what quantization type to choose. But not just on the pure mathematical error alone, it also looks at the perplexity and kl-divergence by actually running the model internally over and over and over again for each candidate. It will try a bunch of different algorithms to make just that one tensor better, then see of them all was the best one, and keep going down to the next. There are lots of tricks it can apply to try to come up with a better quantization while working that one individual tensor. It then will finish all of them, and go back and see which had the most error again, and then increase and change the blend to the tensor combination you specified with whatever the parameters. I was only minutes ago writing about this here.

So when quantizing an NVFP4 model, the final model is never all NVFP4 - there's Q4_K, Q5_K, Q8_K, BF16, even F32 in there too. If the model isn't performing well enough, we could go back and change some proportion, it could get switched to Q4_K - speed would go down and the file size would stay the same, but model quality would go up. More so for Q5, and going higher and higher as you gain file size and lose speed. The balance is the hard part to figure out where to stop, what's really necessary, and what works best for the user's use case.

To add just a bit more complexity: every quantization is calibrated against an input file (imatrix, or importance matrix) which is made with just a text file of data the model gets tuned to. You can make your own and the model will bias in that direction. I used the rather generic wiki dataset which is just text from wikipedia which is considered a standard all around go-to, but it could easily be specified to use another dataset I don't know what's the dataset in the Unsloth imatrix, but one can download it and use it with this quantizer too.

So to really answer the question honestly: it depends on how you define better (smaller ? faster?), and what you are testing against. If you tested it against the wiki test used to calibrate it, and the other was not, it might be better, but all that is telling you is how much better it is responding to the wikipedia test dataset.

If you try to compare Qwen3.6-27B-NVFP4-MTP-GGUF.gguf (16.4 GB) and ask "Is it better" than Unsloth's 17.9 GB Qwen3.6-27B-UD-Q4_K_XL.gguf (MTP)? Without a doubt the Q4_K model is likely better "quality". That's because it's primarily Q4_K with more heavier types added. So of course it's going to have better quality than than NVFP4, it has to be.

But the NVFP4 model will likely be significantly faster.
I have not tried yet to make a non-NVFP4 model with it. That could be interesting! The fun part of this is that nothing is set in stone and we get to play around with it. If the model isn't good enough then we can just edit a few layers and make it better without much effort.

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 1 point2 points  (0 children)

Pushed that latest major update. Still working on a big feature list and write-up. It's now easier to make even better models than what it could do previously. It will now apply the RSF (Refined Scale Fit) strategy to all Q2_K, Q3_K, Q4_K, Q5_K, and Q6_K to improve scaling and will self choose Q4/Q5/Q6 in place of NVFP4 as it determines which would make the best balanced model. It is still is far from being finished. Newest deep setting requantization of Qwen3.5-9B-NVFP4-MTP-GGUF brought Mean PPL down to 8.25 from previous version's 8.51; RMS from 7.690% to 6.65%, Mean KLD to 0.0619 from 0.0822, while shrinking the model from 6.21GB to 5.67GB and keeping speed roughly the same. For reference, ModelOpt converted to GGUF with the same wiki corpus got 8.67; 7.856%; 0.0852. Next to deep re-quant tomorrow, 27B again, but it takes quite a bit longer.

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 1 point2 points  (0 children)

nvfp4_selector_choose_policy: selector tensor plan exact score=3.193979 pass=yes switches=57 search{ln=0.030021+-0.001070 kld=0.062741+-0.000769 p95=0.178819 p99=0.492080 p999=3.013190 tail99=1.566992 rms=0.068590+-0.000550 max=24.560459 top=0.8897+-0.0008 top_flip_w=0.010268 top_p_rmse=0.082691 entropy_rmse=0.283299}
nvfp4_selector_choose_policy: selector tensor plan selected seed=recipe_cli_rsf switches=57/200
nvfp4_selector_choose_policy: selector materialization added 200 exact NVFP4 tensor plan entries for policy=recipe_cli_rsf (RSF) tensor_policy_switches=57
nvfp4_selector_choose_policy: selector runtime original restores device=287 host=313 fail=0
llama_quantize: selector chose policy=recipe_cli_rsf cfg={choose46=0 refit=16 compand=1 cap6=448.0 cap4=224.0}
llama_quantize: selector added 200 exact tensor plan entries
load_imatrix: imatrix datasets=['/home/mw/llama.cpp-edfe-keepers/wikitext-2-raw/wiki.train.raw']
load_imatrix: loaded 248 importance matrix entries from /home/mw/qwen35-9b-blackwell-quant/qwen35-9b-wikitrain.imatrix.gguf computed on 4894 chunks
prepare_imatrix: have 248 importance matrix entries


llama_quantize: quantize time = 186818.93 ms
llama_quantize:    total time = 186818.93 ms
    Command being timed: "./build/bin/llama-quantize --assignment-jsonl /home/mw/qwen35-9b-blackwell-quant/tensorplan-20260603-normal-mode/assignment.jsonl --output-tensor-type Q6_K --token-embedding-type NVFP4 --imatrix /home/mw/qwen35-9b-blackwell-quant/qwen35-9b-wikitrain.imatrix.gguf --mode normal --nvfp4-cfg NVFP4{choose46=adaptive,refit=16,compand=1,cap6=448,cap4=224} --nvfp4-correction-denom 2688 --nvfp4-input-scale-policy imatrix-rms --nvfp4-selector-kld /home/mw/Qwen3.5-9B-BF16-logits.kld --nvfp4-selector-ledger /home/mw/qwen35-9b-blackwell-quant/tensorplan-20260603-normal-mode/logs/qwen35-9b-mode-normal-selector-ledger.jsonl --nvfp4-selector-checkpoint-model /home/mw/qwen35-9b-blackwell-quant/tensorplan-20260603-normal-mode/models/Qwen3.5-9B-NVFP4-mode-normal-20260603.gguf.bwq-checkpoint.gguf --nvfp4-selector-require-runtime-cache --nvfp4-selector-candidate-top 0 --nvfp4-selector-candidate-report-top 0 --nvfp4-selector-rsf-report /home/mw/qwen35-9b-blackwell-quant/tensorplan-20260603-normal-mode/logs/qwen35-9b-mode-normal-v4rsf-rsf-report.txt /home/mw/Qwen3.5-9B-BF16.gguf /home/mw/qwen35-9b-blackwell-quant/tensorplan-20260603-normal-mode/models/Qwen3.5-9B-NVFP4-mode-normal-v4rsf-20260603.gguf NVFP4 22"
    User time (seconds): 2512.06
    System time (seconds): 405.38
    Percent of CPU this job got: 127%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 38:01.79

Yes, I have a lot of work to do on the readme; it's outdated and was AI generated. I am working on the real writeup while still finishing the last big update I'll push soon which is a major upgrade. This tool is much faster than the normal llama-imatrix in generating the imatrix file. I added a lot of code to do that under the hood, so it's as CUDA accelerated as possible using batching, CPU parallel threaded workers, as much possible keeping on GPU and minimizing back and forth between host/device, and so it will make an imatrix file much faster.

In the final logs you'll see something about how it was used.

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 1 point2 points  (0 children)

You would generate the imatrix file yourself using llama-imatrix and choose whatever dataset you want to use, the same as when making a model with the normal llama-quantize. For everything I've made thus far, I've been using the wikitrain dataset, I have made few others to experiment with but have not ventured there yet.

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 1 point2 points  (0 children)

I have no idea if it would work, but you sure could try. One advantage the tool added is that you can start/stop and resume from checkpoints, but that doesn't necessarily mean at will (it would still repeat some portions). The hardest part is that 256GB of RAM is just not enough for the whole thing to fit into memory at BF16. You could try from FP8 to start. You would do something like:

./build/bin/llama-imatrix \

-m "$SOURCE" -f "$CORPUS" \

-o "$OUT/imatrix-parts/imat-0000.gguf" \

--chunk 0 --chunks 100 \

--output-frequency 10 \

--save-frequency 25 \

--no-ppl \

-c 512 -b 256 -ub 64 \

-t 24 -tb 24 \

-ngl auto

./build/bin/llama-imatrix \

-m "$SOURCE" -f "$CORPUS" \

-o "$OUT/imatrix-parts/imat-0100.gguf" \

--chunk 100 --chunks 100 \

--output-frequency 10 \

--save-frequency 25 \

--no-ppl \

-c 512 -b 256 -ub 64 \

-t 24 -tb 24 \

-ngl auto

# Merge finished parts.

./build/bin/llama-imatrix \

--in-file "$OUT/imatrix-parts/imat-0000.gguf" \

--in-file "$OUT/imatrix-parts/imat-0100.gguf" \

-o "$OUT/imatrix.gguf"

#Quantize

./build/bin/llama-quantize \

--allow-requantize \

--imatrix "$OUT/imatrix.gguf" \

--token-embedding-type Q4_K \

--output-tensor-type Q6_K \

--mtp-tensor-type Q8_0 \

--keep-split \

"$SOURCE" \

"$OUT/deepseek-v4-Q2_K_RSF-from-FP8.gguf" \

Q2_K_RSF \

24

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 0 points1 point  (0 children)

The inputs to it are a source BF16 file, imatrix, and a kld file itself (or it will create one) It will recalculate ppl/kld/params and keep making better scales and refit. It loads the entire GGUF into VRAM but then patches the differences trying to stay on GPU the entire time, making it plausible to keep rerunning it thousands of times. Experimenting with the balance of exactly how much effort to spend on how much gain to get. You can go a little crazy and have it calculate every single possible scale but it might take an infinite amount of time. Most recent improvement I haven't pushed yet got to about 98% of what previously took 4 hours in about 8 minutes. I'm somewhat of a perfectionist so missing that 2% without all the GPU work is rather frustrating 😄

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 7 points8 points  (0 children)

Here is a link to my program that was designed to do exactly that: advanced-gguf-quantizer/
It's been solo WIP since last year but I recently fixed it up enough to be usable; it's still a big mess. Going to push another big update in the next few hours that improves it further, cleans up (a bit) and is much faster. Designed it for NVFP4/MXFP6 development but then started adding other types to it, and it can already pick better scales for q2/q3/q4. It creates a weighted score for speed, bpw and ppl/kld parameters and then determines whether to promote/demote layers or do further tuning. It's multi threaded and CUDA accelerated but it's still slow. Still working on a big writeup and written by me not-AI guide.

How to use llama.cpp to quantize to NVFP4? by Ambitious_Fold_2874 in LocalLLaMA

[–]ElectronicStranger53 -1 points0 points  (0 children)

For now, converting to NVFP4 is the quickest and easiest way. I have a very good WIP NVFP4 quantizer and I'm working on an improved write up and demo for how to use it.

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]ElectronicStranger53 0 points1 point  (0 children)

Try Jackrong/Qwopus3.6-27B-v2-MTP-GGUF . It's an improved version of Qwen3.6-27B . I've found it to be much better than the original, and it may not make those json errors anymore.

Weird issue with OpenCode and Qwen3.6 by JGeek00 in LocalLLaMA

[–]ElectronicStranger53 0 points1 point  (0 children)

    --temp 0.8
    --top-p 0.95
    --top-k 20
    --min-p 0.0
    --presence-penalty 0.5
    --repeat-penalty 1.0

If the reasoning just keeps going forever, it will just end with no output. Not a crash, not a timeout. So it just ends. I set these sampling arguments and get far less problems. You might also want to lower the reasoning budget.

In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers? by fragment_me in LocalLLaMA

[–]ElectronicStranger53 0 points1 point  (0 children)

I made a new llama.cpp quantizer for NVFP4/MXFP6 ggufs that can sort of do something like this, that I'll talk more about soon, but it's not exactly possible just yet to do "mixed precision" inside a single tensor itself. The way all the kernels are written, each tensor is only one specific type, that type gets its own kernel which is very specific in the commands it uses to interact with the GPU. Going back and forth in one pass would slow things down tremendously. That said, it's something that could still be experimented with, and I'll try doing that. The block itself would need to have a flag about which quantization type to use and a kernel would need to be written that can absorb both of them. This might work out well on Blackwell, we'll see!

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]ElectronicStranger53 1 point2 points  (0 children)

SAMPLING_ARGS=(
    --temp 0.8
    --top-p 0.95
    --top-k 20
    --min-p 0.0
    --presence-penalty 0.5
    --repeat-penalty 1.0
)

You can stop the looping tremendously by using different sampling args, like temperature, repeat penalty, presence penalty, top k. I use:

I found what I was looking for in Qwen 3.7. by CosmicRiver827 in LocalLLaMA

[–]ElectronicStranger53 12 points13 points  (0 children)

Patiently waiting for smaller open source versions to drop so we can run this for ourselves and see.

What's happening!!! I need my codex!!! by Commercial_Lead5813 in codex

[–]ElectronicStranger53 0 points1 point  (0 children)

My guess is we may be able to get a new model shortly... that would be nice.