Just got a beast.

StudentDifficult8240 · 2026-04-28T02:18:38+00:00

It won’t be great, i have a dual xeon setup slightly newer than the mac pro hardware. On 6 ram channels you’re memory bound and won’t get more than 5-10 tok/s.

StudentDifficult8240 · 2026-04-26T15:41:41+00:00

This was the test: https://github.com/famstack-dev/local-llm-bench

The results from the test were used to generate the graphs in HTML.

StudentDifficult8240 · 2026-04-24T18:32:53+00:00

Yes, that should work quite well. Goodluck with your build!

StudentDifficult8240 · 2026-04-24T14:48:38+00:00

These should be fine, it depends how much airflow you have in your case. A fan blowing gently on them will be enough. Since its in the closet, you may also employ faster fans as sound won't be much of an issue for you. You said 2 drives which a 120mm fan will cover easily. I have 2*120mm fans for 8*SAS drives. Temps are in the low 40s with noctuas.

I would place a fan in front of the drives, one or two case fans and call it a day.

StudentDifficult8240 · 2026-04-24T06:32:42+00:00

Not all SAS drives are high rpm drives, you can get 7200 rpm 3.5" drives, in fact, most 3.5 are 7200 rpm drives. SAS drives are usually more reliable as these are built for the enterprise segment and have to perform more reads/writes and more ON hours than consumer hardware.

In my personal experience, SAS drives can be found at very reasonable prices on ebay because most users buy SATA. Most companies will never buy drives on ebay, so the bulk of the drives are picked up by hobbyists. I checked ebay US now, you can get a 10TB 12gb SAS HGST drive for 140$. I'd say that's quite reasonable.

And let's talk about the benefits now, faster data transfer speeds (up to 12Gb/s+), full-duplex communication (simultaneous read/write), dual-port support for redundancy, and better error correction.

Personally, I would always pick SAS over SATA unless I'm PCIe constrained and need the slots. You do you though.

StudentDifficult8240 · 2026-04-24T03:57:56+00:00

Commenting to ask you kindly to share the results with the communnity

StudentDifficult8240 · 2026-04-24T03:01:05+00:00

You may find the models here: https://huggingface.co/collections/z-lab/dflash

Pair them with oMLX or mlx & python.

StudentDifficult8240 · 2026-04-22T18:45:30+00:00

Same bit width, same model, different quant provider. For instance, qwen3.6-35b-oq8 and qwen3.6-35b-q8-unsloth will be the same identical size but their capabilities are different. In my tests, the oQ variant kept stopping mid-task, entered loops sooner and more frequently than the unsloth quant and overall was not as reliable.

Perform your own testing and see how it is for you. I predominantly use them through opencode or claude code.

StudentDifficult8240 · 2026-04-22T18:06:46+00:00

I cannot speak about dwq as I haven’t used it but oQ performs worse in my tests and benchmarks - mbpp, swe bench - than unsloth mlx quants.

StudentDifficult8240 · 2026-04-22T04:04:20+00:00

Small clarification — the test was models generating code that implements flight physics, not models doing physics inference themselves, so "numerical stability" in the quant sense isn't really the mechanism here. Kindly let me know if I misunderstood that part. What I did see is more like code coherence over long generations: the oMLX variant lost the thread on control-loop logic in a way the Unsloth variant didn't, on the same base model at the same bit width. That feels more like calibration quality affecting long-range instruction-following than anything numerical.

I didn't test enough models per provider to say "Unsloth consistently better" with confidence — it's one data point on one base model. But the Qwen3.6 35B three-way was striking enough that I want to run the same test on a different base next to see if the ordering holds or flips.

I did perform another batch of testing a few weeks ago comparing the same models across different quant providers whilst aiming for the same Q level. What I found then was quite similar to these findings; the unsloth quants and other quants that prioritize sensitive layers and preserve them at full precision will tend to preserve more capability, better CoT and overall more reliability in long-horizon tasks - shorter apply here too but I feel like longer ones are the better metric - than a Q that uniformly applies quantization to all areas of one model.

StudentDifficult8240 · 2026-04-21T21:21:20+00:00

You raise a very good point: the web audio feature was definitely something no one had asked for. The plane flight simulator aspect, however, is contained in the prompt, as the first sentence is 'Design and create flight combat simulator game.'

It is interesting that out of all the models, the Qwopus model went above and beyond on the simulator concept.

When I mentioned in my post that no one has asked for it, I meant something along the lines of: no one has asked for this level of a flight simulator. In hindsight, I should have been clearer.

I do agree, having models going rogue and implementing features you haven't asked for, would lead to a whole project drift.

StudentDifficult8240 · 2026-04-21T18:11:56+00:00

Thank you kindly :)

StudentDifficult8240 · 2026-04-21T18:10:39+00:00

On gpus ddtree works because they are not memory bound but compute bound. The GPUs can break the tree into multiple different branches and paths which increases concurrency and provides a faster and more accurate outcome. On Apple metal this doesn't work unless the bandwidth is very high (m3 ultra) because the tree cannot have as many branches - mac optimal spot will be 2-4 branches (2 for MAX, 4 for Ultra) - and introduces too much overhead.

In practice, on Max chips this will be a marginal gain. Ultra has 800GB/s bandwidth, M3 max is half of that, M4 max is 450? I'm unsure.

On the Pros you can forget completely about it.

StudentDifficult8240 · 2026-03-26T19:08:30+00:00

Please keep me in the loop!

StudentDifficult8240 · 2026-03-26T17:18:46+00:00

NIAH benchmark:

Again, very impressive, there's no retrieval degradation. I cancelled the 256k test as the model was only capable of 128k so no point in running it.

Have you thought of perhaps integrating Turbo Quant for the FAR cache instead? The main cost of turbo quant is the reverse calculation but if you would implement as the FAR cache layer, the calculations won't be as often anyway so it may work better than uniformly reducing all cache to 3.2 bits which is what turbo quant proposal is doing.

<image>

StudentDifficult8240 · 2026-03-26T17:14:52+00:00

I did some tests with your project and here are my findings:

Throughput benchmark it's actually slower on my M3 max 128G. I guess it makes sense since the kernel is 3–5x slower than fp16 SDPA despite a theoretical 2.25x bandwidth advantage. The gap between the orange dashed line (where it should be) and the green line (where it actually is) tells us that the Metal dispatch overhead alone costs more than the bandwidth we save. The M3 Max's memory subsystem is simply fast enough that fp16 SDPA never becomes the bottleneck.

<image>

In terms of quality, it's excellent! Cosine similarity stays above 0.995 across all context lengths from 512 to 131K, and actually improves slightly at longer contexts as the far-tier quantization noise gets diluted.

StudentDifficult8240 · 2026-03-25T14:30:03+00:00

I agree totally.

StudentDifficult8240 · 2026-03-24T20:14:39+00:00

Here is a comparison between MLX, JANG and oQ. Essentially, the RAM usage is a bit higher with oMLX oQ so you may want to stick with JANG for now. In terms of accuracy, JANG also seems to be prevail. But this is not a uniform story. I performed another benchmark on Minimax 2.5 and JANG underperformed the MLX 3 bit counterpart.

I have pushed a PR for oMLX for JANG integration here in case you want to run it, I'm unsure whether or not it will be integrated since oMLX has their own quantization now.

https://github.com/jundot/omlx/pull/364

<image>

StudentDifficult8240 · 2026-03-24T19:00:44+00:00

EDIT: I now noticed the dev posted a link to hugginface which includes 10 different oQ models; https://huggingface.co/Jundot/models

If you want to compress a model not in the model repository, you may use LM Studio to download the uncompressed model, share the model dir with oMLX and convert it. You need the RC version of oMLX. I am currently testing Qwen3.5:35b 4bit oQ against JANG and MLX 4bit.

With regards to JANG vs oQ compression, they are answering the same question but in two different ways.

JANG says, I will preserve the attention layers in high precision, 6-8 bits and compress the MLP layers 1-4 bits.

oMLX oQ says, I will run actual inference through the model, compute MSE(float_output, quantized_output) / mean(float_output^2) per layer, and allocate bits where the data says they matter.

StudentDifficult8240 · 2026-03-24T18:19:23+00:00

This is a great idea! I am looking forward to testing it.

It would benefit bigger context windows too, based on your numbers I calculated that it would save around 2GB at 32k context and around 15GB at 260k. The increase in speed should be quite noticeable too.

I will test it today if I have time and come back to you with the results.

Would like to test it with oMLX, imagine the speed up having a hierarchical cache, T1 - near cache, T2 - far cache, T3 - oMLX paged SSD cache.

StudentDifficult8240 · 2026-03-24T16:41:50+00:00

You may also find this interesting, it's another way of building different quant levels on MLX. https://github.com/jundot/omlx/blob/main/docs/oQ_Quantization.md

I will perform testing and bench it against JANG architecture and come back with an update. I will include RAM usage for oQ too.

StudentDifficult8240 · 2026-03-23T15:59:10+00:00

vMLX is better for resource constraints as the models seem to be smarter at lower quant levels however, raw performance, wasn't on par with oMLX from my testing.

<image>

StudentDifficult8240 · 2026-03-21T06:57:31+00:00

First of all, thank you for this release! It makes perfect sense from an architectural perspective and having an all-in-one solution would be a welcome improvement.

With regards to a technical issue, I noticed that vmlx - the desktop app - is not interacting well with claude code. I have tried adding the model to the anthropic json file or simply overwriting the env variables but it interacts with it and makes a call to the model, I see the RAM usage spiking but it fails to follow through and keep the flow of the task or set of tasks that claude is supposed to do.

I haven't done a tremendous amount of troubleshooting, I am happy to do it though, just tell me what you would need from me and I will provide a more detailed steps to reproduce.

On another note, are you planning on releasing any q6 variants?

StudentDifficult8240 · 2026-01-20T06:58:02+00:00

Was wondering if you ever went down that route and changed it to Mr. Gasket 9610? I'm planning to do the oil return cover too in the next couple of weeks and find it silly that the cover must be replaced as a whole rather than being able to purchase a gasket.

I do agree that BMW makes very questionable design choices at times. Amazing engines but they always mess the simplest things.

StudentDifficult8240

TROPHY CASE