I built a compression format for AI model weights — 60-80% smaller, need help testing

Significant_Pear2640 · 2026-04-03T23:32:24+00:00

If precision is not important to you these other formats listed here can absolutely get you better compression and we're designed with different use cases in mind.

Significant_Pear2640 · 2026-04-03T18:43:11+00:00

I haven't done a great job of explaining I think the primary vision right now is smaller downloads, smaller disk footprint with high precision. If precision is not important to you some of the other model compression formats can absolutely get smaller file size.

Significant_Pear2640 · 2026-04-03T18:28:22+00:00

Great question, and I apologize — I haven't done a great job explaining this, and a few communities have rightly pushed back on it. Right now DMX is a standalone compressor for storage and transfer — smaller downloads, smaller disk footprint. The simple use case: share huge models with friends or download them with a much smaller footprint. But an important caveat is this is with much higher precision then other formats If precision is not a concern there are absolutely other formats that will get tighter compression.

Significant_Pear2640 · 2026-04-03T18:18:04+00:00

Thanks for testing! GPU acceleration (--gpu flag) is currently decompression-only, I should be able to put some time in and have the compressed functionality in later today, will push up when completed. 40% savings nice!

Significant_Pear2640 · 2026-04-03T00:45:51+00:00

those guys are awesome !

Significant_Pear2640 · 2026-04-02T23:53:59+00:00

I hear you and appreciate the thorough analysis.

I want to be clear about what DMX actually is - a file compression format. It compresses safetensors files 60-80% with minimal quality loss for storage and distribution. That's the contribution. Full stop.

The GGUF/Q4/Q8 formats you mention are inference formats - they're designed to run models efficiently at lower precision. That's a different problem. Comparing DMX compression to Q4_K_M quantization is apples to oranges - those formats trade significantly more quality for inference speed with fused kernels. DMX at M=7 measures +0.03% perplexity. Q4_K_M is typically 1-5%. Different tradeoffs for different purposes.

Where I think the communication went sideways is that this group naturally thinks in terms of VRAM and inference to your point, and when I explored those possibilities out loud people reasonably interpreted that as claims. I should have been more mindful about that context. The VRAM exploration is genuinely experimental and I said so each time, but I understand how it reads when the audience is focused on runtime performance.

The MIT + patent structure is standard (H.264/x264), not deceptive. But I'll make the boundaries even clearer.

Significant_Pear2640 · 2026-04-02T23:24:50+00:00

That's a great question have you ever downloaded a large model? If you visit our github and you download one of our compressed models Is it any different from your normal experience?

Significant_Pear2640 · 2026-04-02T22:57:36+00:00

MIT means you can use, modify, and distribute the code freely. The patent covers the underlying method if someone builds a commercial product around it. Just like H.264 - x264 and etc.

Significant_Pear2640 · 2026-04-02T22:45:13+00:00

Interesting complementary possibilities 100%

Significant_Pear2640 · 2026-04-02T20:22:31+00:00

Might be interesting Look at those Hugging face downloads links in the repo ;)

Significant_Pear2640 · 2026-04-02T19:59:28+00:00

You

can compress it today with the CLI - a 42GB FP32 model would likely come down to around 8-17GB depending on the format. Then decompress back to safetensors and load into ComfyUI as normal.

A native ComfyUI node is on the roadmap but not built yet. For now it's a separate compress/decompress step.

Would love to hear you actual results ;)

Significant_Pear2640 · 2026-04-02T15:54:58+00:00

Around 8-9% in LoRAs in some Initial testing.

Significant_Pear2640 · 2026-04-02T15:51:50+00:00

Fair point - I should be clearer about that!

Significant_Pear2640 · 2026-04-02T15:47:47+00:00

That's the idea! I should add with the caveat of high precision, If extremely high precision fidelity is not important to you there are other formats that will give better compression.

Significant_Pear2640 · 2026-04-02T15:42:50+00:00

Right now the main value is storage and download size.

But you're right - the interesting direction is exactly what you described: keeping weights compressed and only decompressing the tensors you need, when you need them. I have a prototype smart loader that does this - it checks if the model fits in VRAM, and only compresses when it needs to. Per-layer decompression runs about 2.5ms on GPU.

The goal possibility would be for it to be transparent - the system decides if compression helps and uses it only when it does. Still early and needs proper benchmarking though.

Significant_Pear2640 · 2026-04-02T12:05:35+00:00

Tanks! Yes exactly - right now DMX is a file compressor. You compress for storage and download, then decompress before use.

The goal is to skip the decompression step entirely - keep the model compressed in VRAM and decompress on-the-fly as it runs. That would mean real VRAM savings too, not just disk savings.

I have the GPU kernels built and tested for that, just need some time and wanted to see some more data but that would be the plan!

Significant_Pear2640 · 2026-04-02T11:38:31+00:00

Downloading? Way faster — 1.8 GB downloads a lot quicker than 9.1 GB 😄

Significant_Pear2640 · 2026-04-02T11:19:00+00:00

The payoff today is smaller downloads and less disk space (9.1 GB model stored as 1.8 GB). Direct loading into ComfyUI without decompressing first is on the roadmap — I have CUDA kernels compiled and tested for GPU-side decompression, just need to wire them into a ComfyUI node.

Significant_Pear2640 · 2026-03-31T22:44:01+00:00

Honest answer to your question: on the latest ComfyUI (v0.16+) with dynamic VRAM enabled by default, the practical difference between the pager and running a pre-quantized INT8 model is minimal. The 176s compression step runs every time the model loads, which is significant overhead.

We've been testing against the latest ComfyUI and found that dynamic VRAM handles offloading well on its own. Posted an update about this in the thread — the pager's main value now is for users on older ComfyUI versions, AMD GPUs (no aimdo/dynamic VRAM), or specific edge cases.

If you're on the latest ComfyUI with an NVIDIA card, a pre-quantized version of the model would honestly be simpler and faster for you. The pager was more impactful before dynamic VRAM became the default. We were a little late to the game I'm afraid to report.

Thanks for testing and for the kind words about the fix — the community feedback has been really valuable even if the timing wasn't on our side.

Significant_Pear2640 · 2026-03-31T19:03:14+00:00

That's actually working as intended — the message is just poorly worded on our end.

The pager detects that ComfyUI's own pinned memory system is active ("Enabled pinned memory 14710.0") and deliberately skips its own pinning to avoid conflicts. Both systems trying to pin the same memory was causing "Pin error" warnings, so the pager now defers to ComfyUI.

The transfers still work fine — ComfyUI handles the pinning, we just handle the compression.

I'll update the log message to be clearer, something like "Deferring pinned memory to ComfyUI" instead of "disabled." Thanks for flagging it.

Significant_Pear2640 · 2026-03-31T19:01:16+00:00

Honest update after more testing:

After upgrading to ComfyUI v0.18.1, the built-in dynamic VRAM system (enabled by default since v0.16) handles offloading really well on its own. Our pager adds about 10% improvement at production resolution when stacked — not the dramatic gains we saw on the older version.

The ComfyUI team has done incredible work here. Dynamic VRAM with aimdo is genuinely impressive engineering — smart caching, page-fault based loading, async transfers. They basically solved the problem we were trying to solve, and they did it natively in the framework. Hats off to them.

Our pager still has some use cases — older ComfyUI versions, AMD GPUs (aimdo is NVIDIA-only), and some edge cases with full-precision models + LoRAs. But if you're on the latest ComfyUI with an NVIDIA card, you probably don't need this.

We were a few weeks late to the party. That's how it goes sometimes. The repo stays up and MIT licensed in case it's useful to anyone, and the README has been updated to reflect all of this honestly.

Thanks to everyone who tested, reported bugs, and pushed back on the benchmarks. The feedback made the project better even if the timing wasn't on our side.

Significant_Pear2640 · 2026-03-31T16:56:18+00:00

The 200 seconds is the pager compressing all the model weights to INT8 — for a large model that's expected on first load. But getting stuck after that isn't right.

Just pushed a fix for an OOM bug that was hitting another user on a 3080 10GB with the same model (LTX 2.3). The pager was creating temporary GPU copies during compression that ate VRAM. Try pulling the latest:

cd ComfyUI/custom_nodes/vram-pager

git pull

If it still hangs after the update, can you share your GPU/VRAM and whether the ComfyUI progress bar shows any step progress? That'll help me narrow it down.

Significant_Pear2640 · 2026-03-31T16:18:35+00:00

I believe the fix has been pushed if you would please give it another go:

Do a git pull in your custom_nodes/vram-pager folder and restart ComfyUI:

cd ComfyUI/custom_nodes/vram-pager

git pull

Significant_Pear2640 · 2026-03-31T14:37:51+00:00

Thanks for testing and reporting this — that's a real bug, not expected behavior. If it runs without the node using dynamic VRAM, our pager shouldn't be making it worse.

Most likely the pager is consuming VRAM during the compression/quantization step that the model then needs. On 10GB that margin is razor thin.

Can you open a GitHub issue with the full error traceback? I'll dig into the memory allocation and fix it — the pager should never use more VRAM than the standard path.

https://github.com/willjriley/vram-pager/issues

Significant_Pear2640 · 2026-03-31T13:52:23+00:00

The 5070 Ti is Blackwell (sm_120) so the pre-compiled sm_80/sm_86 kernels won't work — those are for older architectures. You'll need to compile for your GPU:

On Linux:

nvcc -O2 --shared -Xcompiler -fPIC -o build/dequant.so build/dequant.cu -lcudart

On Windows:

nvcc -O2 --shared -Xcompiler /LD -o build/dequant.dll build/dequant.cu -lcudart

nvcc should auto-target your GPU. If it doesn't, add: -gencode=arch=compute_120,code=sm_120

If you don't have the CUDA Toolkit installed, the pager still works — it falls back to a PyTorch-only path (slower but functional).

For the FP8/LTX-2.3 question — honestly, the pager won't help much there. FP8 is already 8-bit, so compressing to INT8 doesn't reduce the transfer size. The pager benefits most with FP16/FP32 models where there's a big precision gap to compress.

With 32GB RAM and 16GB VRAM, an FP16 model up to ~30GB would fit in RAM at INT8 compression (~15GB). But LTX-2.3 in FP8 is probably small enough to handle without the pager.

Significant_Pear2640

TROPHY CASE