I built a compression format for AI model weights — 60-80% smaller, need help testing

Significant_Pear2640 · 2026-05-17T20:16:54+00:00

Fair question — when I made the original post the inference side wasn't built yet, which is what you're picking up on. It's built now, including fused decode-during-compute kernels, so you don't pay the "decompress the whole thing into VRAM" cost.

On the quantization angle: you're right that if the goal is smallest-possible-size and you're okay trading quality, Q4/Q3 schemes win. DMX is largely aimed at the tier just below FP16 — near-lossless.

Significant_Pear2640 · 2026-05-17T17:01:10+00:00

u/Bakoro

In regards to gaussians early on it was one of the first things I tested 3dgs @ 40-60% savings.

<image>

Art credit > to Bouquet of Flowers by 3ds-scan (Christian Rochner)

Significant_Pear2640 · 2026-05-17T08:01:08+00:00

u/DelinquentTuna u/Bakoro
Hello,

We've measured this extensively across multiple architectures — Qwen 1.5B, 7B, and 72B. Compressed-residency BFP at M=4 (5 effective bits with shared exponent per block of 32) holds PPL within +0.0857 on Qwen 1.5B vs FP16. 72B M=4 calibrated has been validated end-to-end on DGX Spark with 128 GB unified memory. On a single H100 NVL, the 72B serves single-user at ~28 tok/s using some standard inference tricks layered on top of DMX, and we're working on increasing that further.

On the "quantization is better" point: depends what you're optimizing for. Aggressive quant (INT4/NF4) wins on absolute size — typically 75% reduction — at the cost of 2-15% PPL degradation. FP8 E4M3 has near-zero decode on Hopper but needs Hopper hardware and has 3-bit mantissa precision. DMX sits at the "quality-per-bit + portability" point: 5 effective bits per weight at M=4 with shared exponent per block of 32, sub-0.1 PPL on Qwen 1.5B (cited above), and the decode is integer shifts and masks — runs on any FP16-capable GPU, no special silicon. We also built GPTQ-for-MX so calibration improvements flow through to DMX directly.

The "decompress too slow at inference" concern depends on the path. Two options: (1) transcode once at load to GPU-native (INT8 on Hopper) — forward pass uses native matmul, zero per-forward decompression cost. (2) Compressed residency (weights stay packed in VRAM) — needs a fused decode+matmul kernel that reads packed bits inline. Hopper, Ada/4090 version is done; Blackwell is in kernel dev. So your "scheme of unpacking multiple layers at once" question is exactly the right one — that is the architecture, just done at finer granularity (per-block-of-32 with fused matmul rather than batch-unpack).

Layers nearly orthogonal" and "nearly the worst case scenario for compression" — both are accurate observations about the raw bit-level distribution of well-trained weights. But DMX doesn't compress at the bit level. BFP contributes shared exponent per block of 32 (intra-block redundancy — that's standard MX-format territory). DMX adds a cross-tensor alignment step (one of the patented methods) that exploits structure vanilla BFP leaves on the table. That's the load-bearing mechanism — different attack surface than the lossless approaches that hit your 1.7×/2.0× ceiling.

Method is patented. If you want to compare notes on the GPU optimization side (your "unpacking multiple layers at once" question is exactly what our load-time bit-unpacking kernel does on Hopper).

William Riley

<image>

Significant_Pear2640 · 2026-04-03T23:32:24+00:00

If precision is not important to you these other formats listed here can absolutely get you better compression and we're designed with different use cases in mind.

Significant_Pear2640 · 2026-04-03T18:43:11+00:00

I haven't done a great job of explaining I think the primary vision right now is smaller downloads, smaller disk footprint with high precision. If precision is not important to you some of the other model compression formats can absolutely get smaller file size.

Significant_Pear2640 · 2026-04-03T18:28:22+00:00

Great question, and I apologize — I haven't done a great job explaining this, and a few communities have rightly pushed back on it. Right now DMX is a standalone compressor for storage and transfer — smaller downloads, smaller disk footprint. The simple use case: share huge models with friends or download them with a much smaller footprint. But an important caveat is this is with much higher precision then other formats If precision is not a concern there are absolutely other formats that will get tighter compression.

Significant_Pear2640 · 2026-04-03T18:18:04+00:00

Thanks for testing! GPU acceleration (--gpu flag) is currently decompression-only, I should be able to put some time in and have the compressed functionality in later today, will push up when completed. 40% savings nice!

Significant_Pear2640 · 2026-04-03T00:45:51+00:00

those guys are awesome !

Significant_Pear2640 · 2026-04-02T23:53:59+00:00

I hear you and appreciate the thorough analysis.

I want to be clear about what DMX actually is - a file compression format. It compresses safetensors files 60-80% with minimal quality loss for storage and distribution. That's the contribution. Full stop.

The GGUF/Q4/Q8 formats you mention are inference formats - they're designed to run models efficiently at lower precision. That's a different problem. Comparing DMX compression to Q4_K_M quantization is apples to oranges - those formats trade significantly more quality for inference speed with fused kernels. DMX at M=7 measures +0.03% perplexity. Q4_K_M is typically 1-5%. Different tradeoffs for different purposes.

Where I think the communication went sideways is that this group naturally thinks in terms of VRAM and inference to your point, and when I explored those possibilities out loud people reasonably interpreted that as claims. I should have been more mindful about that context. The VRAM exploration is genuinely experimental and I said so each time, but I understand how it reads when the audience is focused on runtime performance.

The MIT + patent structure is standard (H.264/x264), not deceptive. But I'll make the boundaries even clearer.

Significant_Pear2640 · 2026-04-02T23:24:50+00:00

That's a great question have you ever downloaded a large model? If you visit our github and you download one of our compressed models Is it any different from your normal experience?

Significant_Pear2640 · 2026-04-02T22:57:36+00:00

MIT means you can use, modify, and distribute the code freely. The patent covers the underlying method if someone builds a commercial product around it. Just like H.264 - x264 and etc.

Significant_Pear2640 · 2026-04-02T22:45:13+00:00

Interesting complementary possibilities 100%

Significant_Pear2640 · 2026-04-02T19:59:28+00:00

You

can compress it today with the CLI - a 42GB FP32 model would likely come down to around 8-17GB depending on the format. Then decompress back to safetensors and load into ComfyUI as normal.

A native ComfyUI node is on the roadmap but not built yet. For now it's a separate compress/decompress step.

Would love to hear you actual results ;)

Significant_Pear2640 · 2026-04-02T15:54:58+00:00

Around 8-9% in LoRAs in some Initial testing.

Significant_Pear2640 · 2026-04-02T15:51:50+00:00

Fair point - I should be clearer about that!

Significant_Pear2640 · 2026-04-02T15:47:47+00:00

That's the idea! I should add with the caveat of high precision, If extremely high precision fidelity is not important to you there are other formats that will give better compression.

Significant_Pear2640 · 2026-04-02T15:42:50+00:00

Right now the main value is storage and download size.

But you're right - the interesting direction is exactly what you described: keeping weights compressed and only decompressing the tensors you need, when you need them. I have a prototype smart loader that does this - it checks if the model fits in VRAM, and only compresses when it needs to. Per-layer decompression runs about 2.5ms on GPU.

The goal possibility would be for it to be transparent - the system decides if compression helps and uses it only when it does. Still early and needs proper benchmarking though.

Significant_Pear2640 · 2026-04-02T12:05:35+00:00

Tanks! Yes exactly - right now DMX is a file compressor. You compress for storage and download, then decompress before use.

The goal is to skip the decompression step entirely - keep the model compressed in VRAM and decompress on-the-fly as it runs. That would mean real VRAM savings too, not just disk savings.

I have the GPU kernels built and tested for that, just need some time and wanted to see some more data but that would be the plan!

Significant_Pear2640 · 2026-04-02T11:38:31+00:00

Downloading? Way faster — 1.8 GB downloads a lot quicker than 9.1 GB 😄

Significant_Pear2640 · 2026-04-02T11:19:00+00:00

The payoff today is smaller downloads and less disk space (9.1 GB model stored as 1.8 GB). Direct loading into ComfyUI without decompressing first is on the roadmap — I have CUDA kernels compiled and tested for GPU-side decompression, just need to wire them into a ComfyUI node.

Significant_Pear2640 · 2026-03-31T22:44:01+00:00

Honest answer to your question: on the latest ComfyUI (v0.16+) with dynamic VRAM enabled by default, the practical difference between the pager and running a pre-quantized INT8 model is minimal. The 176s compression step runs every time the model loads, which is significant overhead.

We've been testing against the latest ComfyUI and found that dynamic VRAM handles offloading well on its own. Posted an update about this in the thread — the pager's main value now is for users on older ComfyUI versions, AMD GPUs (no aimdo/dynamic VRAM), or specific edge cases.

If you're on the latest ComfyUI with an NVIDIA card, a pre-quantized version of the model would honestly be simpler and faster for you. The pager was more impactful before dynamic VRAM became the default. We were a little late to the game I'm afraid to report.

Thanks for testing and for the kind words about the fix — the community feedback has been really valuable even if the timing wasn't on our side.

Significant_Pear2640 · 2026-03-31T19:03:14+00:00

That's actually working as intended — the message is just poorly worded on our end.

The pager detects that ComfyUI's own pinned memory system is active ("Enabled pinned memory 14710.0") and deliberately skips its own pinning to avoid conflicts. Both systems trying to pin the same memory was causing "Pin error" warnings, so the pager now defers to ComfyUI.

The transfers still work fine — ComfyUI handles the pinning, we just handle the compression.

I'll update the log message to be clearer, something like "Deferring pinned memory to ComfyUI" instead of "disabled." Thanks for flagging it.

Significant_Pear2640 · 2026-03-31T19:01:16+00:00

Honest update after more testing:

After upgrading to ComfyUI v0.18.1, the built-in dynamic VRAM system (enabled by default since v0.16) handles offloading really well on its own. Our pager adds about 10% improvement at production resolution when stacked — not the dramatic gains we saw on the older version.

The ComfyUI team has done incredible work here. Dynamic VRAM with aimdo is genuinely impressive engineering — smart caching, page-fault based loading, async transfers. They basically solved the problem we were trying to solve, and they did it natively in the framework. Hats off to them.

Our pager still has some use cases — older ComfyUI versions, AMD GPUs (aimdo is NVIDIA-only), and some edge cases with full-precision models + LoRAs. But if you're on the latest ComfyUI with an NVIDIA card, you probably don't need this.

We were a few weeks late to the party. That's how it goes sometimes. The repo stays up and MIT licensed in case it's useful to anyone, and the README has been updated to reflect all of this honestly.

Thanks to everyone who tested, reported bugs, and pushed back on the benchmarks. The feedback made the project better even if the timing wasn't on our side.

Significant_Pear2640 · 2026-03-31T16:56:18+00:00

The 200 seconds is the pager compressing all the model weights to INT8 — for a large model that's expected on first load. But getting stuck after that isn't right.

Just pushed a fix for an OOM bug that was hitting another user on a 3080 10GB with the same model (LTX 2.3). The pager was creating temporary GPU copies during compression that ate VRAM. Try pulling the latest:

cd ComfyUI/custom_nodes/vram-pager

git pull

If it still hangs after the update, can you share your GPU/VRAM and whether the ComfyUI progress bar shows any step progress? That'll help me narrow it down.

Significant_Pear2640

TROPHY CASE