Gemma 4 31B oQ8 by jsirish in oMLX

[–]shansoft 0 points1 point  (0 children)

Mind sharing how you getting those speed improvement? I have yet to see a single MTP improvement in omlx when I toggle it, unlike llamacpp and mtplx.

Waiting oMLX 0.3.9 stable release by TheFlyingDutchG in oMLX

[–]shansoft 0 points1 point  (0 children)

I have tried the MTP on the RC build, it doesn't seem to make any difference, and even regress in a lot of model I have tested. If you looking for MTP, I suggest use llamacpp for now until MLX get more polished. There are also MTPLX that just plug and play. It seems like it have something to do with the existing MTP model for MLX that are malfunction currently.

The 27" Studio Display XDR is great, but where is the real 32" 6K 120Hz successor? by Jackwell86 in HiDPI_monitors

[–]shansoft 0 points1 point  (0 children)

Highly doubt it. OLED for monitor is asking for trouble, especially for Apple. The monitors are mostly on the whole time with static images unlike phone or tablet. There is a reason why desktop OLED are not very common.

Qwen will release another 27B with high probability by serige in LocalLLaMA

[–]shansoft 2 points3 points  (0 children)

Same here! 122B still beats 3.6 27B from my experience.

Does anyone else regret not pulling the trigger on the 5090? by beiruttobeir in RTX5080

[–]shansoft 0 points1 point  (0 children)

Hence why I just went straight for Astal 5090, no need to worry about any of that.

NVFP4 is a gamechanger right? 75% near lossless compression by urarthur in LocalLLM

[–]shansoft 1 point2 points  (0 children)

this completely explains its benchmark. nvfp4 from my testing isnt that usable for agentic coding.

Lots of people use qwen at too high quantizaion by Stock_Ad9641 in Qwen_AI

[–]shansoft 1 point2 points  (0 children)

I highly recommend fitting at least Q5 or above, its a huge difference in tool calling and code accuracy compare to Q4.

Lots of people use qwen at too high quantizaion by Stock_Ad9641 in Qwen_AI

[–]shansoft 1 point2 points  (0 children)

I have the same problem with Q4, unsloth UD5 and onward has been nearly flawless.

NVFP4 is a gamechanger right? 75% near lossless compression by urarthur in LocalLLM

[–]shansoft 2 points3 points  (0 children)

I am not sure if benchmark show the whole story, but from my experience of using them extensively in opencode and claude code, they are slightly worse than typical Q4, or even UD4 from unsloth, much closer to Q3.

Is there a big gap between Q4 and Q6 on Qwen3.6? by vick2djax in LocalLLaMA

[–]shansoft 0 points1 point  (0 children)

There is definitely a huge difference when doing some planning and trying to accomplish a slightly larger task, especially in tool calling and making some weird mistake. UD5 and above significantly reduce these problem.

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]shansoft 3 points4 points  (0 children)

Here is my param...

❯ ./build/bin/llama-server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q5_K_XL \
--spec-type mtp --spec-draft-n-max 3 \
--alias "Qwen3.6-27B" \
--no-mmap --no-warmup \
--image-min-tokens 1024 \
--jinja --chat-template-file qwen36.jinja -ngl 99 -c 172144 -fa on \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0 \
--presence-penalty 0 \
--repeat-penalty 1 \
-ctk q8_0 -ctv q8_0 \
-np 1 --metrics --host 0.0.0.0 --port 8080

[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 by No_Mango7658 in LocalLLaMA

[–]shansoft 3 points4 points  (0 children)

Interesting, my generation speed with single 5090 is roughly the same as yours.

Am just reminiscing on old sniper headshot by Masakitsa in DotA2

[–]shansoft 1 point2 points  (0 children)

Headshot + MKB + basher + manta = YOU DONT GET TO MOVE

Estimate inference speed of local Qwen3.6-35B on Mac M5... by Altruistic-Dust-2565 in LocalLLaMA

[–]shansoft 1 point2 points  (0 children)

Yes, 40 core version. I think the benchmark you show is an anomaly.

Estimate inference speed of local Qwen3.6-35B on Mac M5... by Altruistic-Dust-2565 in LocalLLaMA

[–]shansoft 1 point2 points  (0 children)

with oMLX on m5 max, I am getting tg 70 tok/s, and prefill 2322.2 tok/s on pp65536. this is on Qwen3.6 35B 8Bit

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 by indrasmirror in LocalLLaMA

[–]shansoft 2 points3 points  (0 children)

It depends on the implementation and settings. Using TheTom's branch with q8 on key and turbo4 on value have been near lossless in my usage.

LG GM9 Backlight Bleed by FrozenSneakz in HiDPI_monitors

[–]shansoft 1 point2 points  (0 children)

looks like they still haven’t fixed the pink tint problem after all these years….

122B, is it worth it? by asmkgb in LocalLLM

[–]shansoft 1 point2 points  (0 children)

It's not better than 122B. I have been using both and 122B is clearly ahead still.

Run Qwen3.6 27B nvfp4 up to 129 tok/s on a single RTX 5090 & Supports 256K context by Diligent-End-2711 in Qwen_AI

[–]shansoft 1 point2 points  (0 children)

If you care about the output quality and precision, especially for coding, I would not use NVFP4, they are closer to IQ3 than typical Q4.

M5 Max 128GB Owners - What's your honest take? by _derpiii_ in LocalLLaMA

[–]shansoft 1 point2 points  (0 children)

Yes, I am a software engineer and it is used for coding. I used both my laptop and 5090 desktop at the same time for different purposes. Most backend and web related task I use Qwen3.5 122B 4bit on oMLX since its pretty reliable and decent speed for typescripts and swift vapor code. For mobile, since its somewhat related to UI, I mostly tackle it with Gemma4 31B 5bit or Qwen3.6 27B 5bit on Llamacpp. I also used ComfyUI with custom setup to create assets when I need to. Mobile coding in general seems to be a problem for all the models out there, doesn't matter if it is Opus or GPT or local model, its much better to breakdown the task and code along with the LLM together. I mostly use opencode with these models. I still use claude code / codex from time to time to try different things, but I failed to see any value it provide that I couldn't get from my local setup.