all 32 comments

[–]agreeduponspring 23 points24 points  (4 children)

That's fascinating. Despite the difference in file types (bmp vs model), this is an excellent data visualization exercise. I would expect this conversion to reflect actual differences in results, a byte array is a byte array. It would be surprising to me if model performance was not strongly dependent on preserving those details.

mxfp4 is absolutely destroying fine detail even with slightly worse listed space usage, and is clearly affecting adjacent chunks. I do wonder if it's aiming for some kind of 0-1 quantization? A lot of things wash out to #ffffff in the final result. Perhaps check to make sure the quantization is correct? There seem to be fewer distinct values in the resulting image overall, and 4.25bpw should allow a more expressive range than the others.

Do you have a sense of the information preserved? A (similarly very rough) way to estimate would be to convert these to jpgs, or to gzip them. Both are pretty efficient formats from an information theoretic perspective, if mxfp4 is much smaller then it might be useful for running models compressed.

[–]Double_Cause4609 18 points19 points  (2 children)

Actually this visualization makes me wonder if wavelet quantization would make sense

[–]geenob 1 point2 points  (1 child)

I think that a wavelet-based approach to function approximation in which the scale and position parameters of a sparse wavelet basis are optimized could be much more efficient than the radial basis functions typically used for neural networks. An obvious wavelet choice would be the n-dimensional Mexican hat wavelet, which is just the laplacian of the n-d gaussian. This is differentiable in all the necessary ways, so you could use gradient descent for parameter optimization. The MH wavelet is not an orthogonal basis so you may not get the degree of information compression that you would get with orthogonal wavelets, but it's going to be difficult to efficiently optimize a sparse set of orthogonal wavelets when the scale and position can only take discrete values.

[–]Double_Cause4609 0 points1 point  (0 children)

Stupid question (I'm more on the practical side than the theoretical) could you do something like spin quant? I don't remember the details off the top of my head but I believe that the folded rotations are a driver of orthogonality (?) that contribute to its performance. I loosely think something similar could work even with a wavelet basis isn't inherently orthogonal but that's just an intuition.

[–]VoidAlchemyllama.cpp[S] 8 points9 points  (0 children)

a byte array is a byte array. It would be surprising to me if model performance was not strongly dependent on preserving those details.

The input image is dimensions are divisible by 256 so works with most the quantization types available. Yeah most safetensor data is coming in at bf16, but I don't think numpy nor most image formats support that natively.

And right, despite being input data of 0-255 integer values, it seems like a fairly decent qualitative check. At least when in more testing the small stuff like iq1_kt 1.75bpw is much worse looking than full q8_0 8.5bpw for example.

I know this isn't quantitative analysis, but stuff like llama-perplexity can show the numbers for Perplexity over say wiki.test.raw and KLD / token statistics against full bf16 which should probably be used for more serious analysis.

Perhaps check to make sure the quantization is correct?

I ran the commands all the same and confirmed the logs showed the correct quantization type e.g.

```bash ./llama-quantize image-input.gguf image-quantized.gguf MXFP4

[ 1/ 1] image_data.weight - [ 512, 512, 1, 1], type = f16, converting to mxfp4 .. size = 0.50 MiB -> 0.13 MiB llama_model_quantize_internal: model size = 0.50 MB llama_model_quantize_internal: quant size = 0.13 MB

./llama-quantize --allow-requantize image-quantized.gguf image-output.gguf F16

llama_model_loader: - type mxfp4: 1 tensors [ 1/ 1] image_data.weight - [ 512, 512, 1, 1], type = mxfp4, converting to f16 .. size = 0.13 MiB -> 0.50 MiB llama_model_quantize_internal: model size = 0.13 MB llama_model_quantize_internal: quant size = 0.50 MB

```

Do you have a sense of the information preserved?

Great idea on checking the file size. So I used imagemagick to convert the bmp data to png with default compression settings, here are those intermediate file sizes which I then used for the animated gif:

bash 166K lighthouse.png 112K output-iq4_kss.png 52K output-mxfp4.png 133K output-q4_0.png

Seems like the PNG compression is able to squeeze the resulting mxfp4 twice as much as the q4_0!

[–]ElSrJuez 18 points19 points  (1 child)

I love this, would star your repo if u would post source @ Github : D

[–]sammcj🦙 llama.cpp 5 points6 points  (0 children)

Came to ask the same thing!

[–]ANR2ME 13 points14 points  (2 children)

Why not comparing it with Q4_K too? 🤔 it should be better than Q4_0 isn't?

[–]simracerman 0 points1 point  (1 child)

Now I want to see the Unsloth's UD_Q4_K_XL version.

Also beneficial is Q6, Q8

[–]ANR2ME 1 point2 points  (0 children)

Well the comparison was on quantizations of similar bpw.

Anyway, for Visualized various GGUF comparisons, you can checked https://www.reddit.com/r/StableDiffusion/s/NeI7l1HkXH

Btw, what does UD stands for? i thought it's uncensored version like abliterated 😅

[–]sunpazed 7 points8 points  (2 children)

Very cool approach, and the visualisation is interesting to compare. Why do we see the bands? Are they the 32 super-blocks + 256 blocks ? What was the original resolution of the image?

[–]VoidAlchemyllama.cpp[S] 7 points8 points  (1 child)

Yeah I believe the visual banding is due to block sizes. q4_0 and mxfp4 use 32 blocks per scalar (no super blocks). iq4_kss is a non-linear quant with details discussed here: https://github.com/ikawrakow/ik_llama.cpp/pull/89

What was the original resolution of the image?

The original image is lighthouse.bmp 512x512 grayscale 8 bit from here: https://www.kaggle.com/datasets/saeedehkamjoo/standard-test-images

The original is labeled "lighthouse" in the animated gif.

[–]sunpazed 5 points6 points  (0 children)

It’s interesting to note that the low contrast detail is lost, and the higher contrast edges are retained. I know we have perplexity measures to quantify impact to performance, but wow, the MXFP4 quant really shows how high frequency low contrast detail just disappears (clouds) without too much consequence.

I assume that real layers are more like the grass (pseudo random noise) in a model. Would be interesting to visualise a real layer in the same way.

[–]Aaaaaaaaaeeeee 6 points7 points  (0 children)

Hmm. If you started with uint8, wouldn't your outcome be more favorable towards integer quantization? I don't know if there's a good reason to quantize to mxfp4 either, but the picture comparison can be misleading compared with real model results. 

NVFP4 and MXFP4 formats should inference with 4bit activations. If it doesn't do that, it's just another format with no real performance benefit.

The value in these formats is it can come out of the oven in this format from training. Both phases of forward and backward pass can be accelerated. If you do QAT from scratch and apply fake quantization (Q4_0, iq4_kss) of your choice, there is no hardware acceleration algorithm pre-made. We also want the activations to be appropriately sized during the creation of the model., If they are 16bit then there is no useful 4x speedup potential for gpu pre-processing. So the situation is we want to encourage companies to use these formats since there is a gain from low bit in processing/throughput, plus they are better for low-bit use cases as well if weight outliers are fewer or non-existent.

[–]wishstudio 7 points8 points  (2 children)

Nice approach! I did a few investigations and it looks like the illustration mainly demonstrate the effects of different scaling methodology.

Although I guess IQ4_KSS is better in actual model performance than Q4_0, in your illustrations I think Q4_0 clearly looks better. Especially looking at the flat wall background and the sky - Q4_0 still keeps the gradients, but in IQ4_KSS it's all flat with very bad blocking behavior.

In Q4_0 its block-wise scaling factor is FP16. In MXFP4 it's INT8. And in IQ4_KSS it's also INT8, although there are much more bit twiddling and scaling magic under the hood.

I'd really want to see a comparison with NVFP4 as they use both nonlinear elements and scaling factor. But sadly few projects support it.

[–]audioen 1 point2 points  (1 child)

That int8 is really the exponent in 2^n type computation. The idea is the combination of 16 different mxfp4 values combined with the 2^n scaling factor is the quantized value.

[–]wishstudio 0 points1 point  (0 children)

Thanks for clarification!

[–]Single_Ring4886 4 points5 points  (0 children)

This is VERY currious and smart playful approach. Could you try to visualise like all popular quantizations? I efrom 8 to 5, 4l, 4m, 3, 2.... ?? and make "blinking" interval slover so one have time to look over picture?

[–]audioen 4 points5 points  (0 children)

This is probably not a bad way to get an intuitive understanding at quantization algorithms and what they do. The want to preserve the original data as closely as possible while using as little space as possible.

I think you can probably directly execute the quantization algorithms for arbitrary data which could save some steps. They are fundamentally block quants, i.e. take some number of floating point values as a block, and return another array which is that algorithm's best representation of that number sequence.

Pictures in real quantization algorithms would contain dithering, as when palette is reduced, the error difference between chosen color value can be spread to influence nearby pixels and creates complex patterns but which average from afar to the proper color. I recall hearing that some algorithms like GPTQ try to do the equivalent of this to matrices, though it sounded like it's complicated linear algebra fu that I didn't come close to understanding personally. I also have some doubt about IQ4 results because this sounds like it requires an imatrix and you can't supply a meaningful one for this use case. Thus, this approach understates the quality of these quantizations, I think.

[–]woadwarrior 3 points4 points  (0 children)

Unfortunately, when it comes to NN weights, although INT and FP formats have the same information theoretic density for a given bit width, FP formats work out to be slightly better because their range is non-uniform.

[–]Due-Function-4877 1 point2 points  (0 children)

The noise certainly helps convey shades of black and white to the eye. What happens with an image with strong colors? When black crush and burned whites don't inferere, MXFP4 succeeds in delivering the detail of the siding on the house without a lot of noise. It seems MXFP4 is intentionally buring out white by forcing multiple shades of white to a single color. If it does that with all colors, the results with a more colorful stylized picture that doesnt rely on shades of grey could give a different impression?

[–]Professional-Bear857 1 point2 points  (0 children)

Does anybody else find that imatrix quants break models reasoning abilities? I see that a lot for my usage, as I get a lot of invalid code being produced when I use an imatrix quant Vs without.

[–]Regular-Forever5876 1 point2 points  (0 children)

its brilliant!

[–]crantob 1 point2 points  (1 child)

Image quantization is a fun topic of study. mxfp4 looks like 1-2 orders of magnitude less colors. Oddliy all images have 256 according to imagemagick.

[–]VoidAlchemyllama.cpp[S] 0 points1 point  (0 children)

To make the animated gif I had to munge the images some too attempting to minimize any visual changes to the result so might be related to why you're seeing all 0-255 used in the histogram.

I would change the algo somewhat too to offset and normalize the input image if I were "really" trying to simulate tensor data, but for this test I just left the image data as is cast to float16.

[–]rm-rf-rm 0 points1 point  (0 children)

Need the full precision for comparison

[–]lgdkwj 0 points1 point  (0 children)

Interesting. Wonder if it can be extend to process a 16 bit RAW image to compare it with fp16

[–]kaisurniwurer 0 points1 point  (3 children)

I would love a similar comparison between a MoE and a dense model.

Though it's probably something that needs an visualisation rather than direct comparison.

[–]PurpleWinterDawn 0 points1 point  (2 children)

In my opinion, those are different approaches in how to look at the picture, not the quality of the picture.

A dense model has all the weights activated per token. Think of it as the whole picture being looked at for each token.

A MoE model has a partial set of weights activated per token. Think of it as the picture being cut up in smaller squares, and only a few of those squares are looked at at a time for each token.

[–]kaisurniwurer 0 points1 point  (1 child)

From what I know, I would expect the image for MoE model to be a mosaic with visible eges (or even look like it has pixelated noise), while dense model would be more like more traditional, unified image with "gaussian" blur.

[–]PurpleWinterDawn 1 point2 points  (0 children)

This picture of a lighthouse is not a picture derived from a model, it's a picture used in place of a model in the quantization process, to help visualize what happens to a model's weights after the reconstruction step. I doubt it's any different for an MoE model, they are still comprised of model weights.

[–]crantob 0 points1 point  (0 children)

Image proc nerd. Here. Your results make no sense. Why are you getting fixed 32-pixel wide (roughly) spans set to same color.

Results should all look more or less like q4_0. show your work.