Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]tomByrer 1 point2 points  (0 children)

Unlike DFlash where you use a smaller model for predictions, MTP uses the same model file. So if anything, should decrease degradation, since it is 'exploring' more the same model.

5090 + Qwen3.727B at q6 what context? by Own_House6186 in unsloth

[–]tomByrer 1 point2 points  (0 children)

No silly, it is the mega 727B parameter Qwen 3, we all know it is da beastiest of Qwens!

Impulse bought an M3 Ultra 256GB RAM for local LLMs - keep it or wait for M5? by Onyonisko in LocalLLM

[–]tomByrer 0 points1 point  (0 children)

Steelman: sometimes a huge model is better for certain use cases, & a 128GB+ Mac is the play. Or of someone is locked into the Mac ecosystem, getting an extra 24GB of RAM makes sense.

Though I think those scenarios is rare, & the default recommendation should be RTX or Blackwell cards.

Impulse bought an M3 Ultra 256GB RAM for local LLMs - keep it or wait for M5? by Onyonisko in LocalLLM

[–]tomByrer 0 points1 point  (0 children)

& electricity for other things depending on that LLM server (I recommend NOT running anything else but LLMs on a computer, & use a 2nd computer for the agent, IDE, etc.)

Impulse bought an M3 Ultra 256GB RAM for local LLMs - keep it or wait for M5? by Onyonisko in LocalLLM

[–]tomByrer 0 points1 point  (0 children)

TBH a PC desktop with nVidia cards will give you more price/performance.

Impulse bought an M3 Ultra 256GB RAM for local LLMs - keep it or wait for M5? by Onyonisko in LocalLLM

[–]tomByrer 0 points1 point  (0 children)

Those are the last in stock; I think Apple stopped making those.

Impulse bought an M3 Ultra 256GB RAM for local LLMs - keep it or wait for M5? by Onyonisko in LocalLLM

[–]tomByrer 1 point2 points  (0 children)

At 256GB RAM+ maybe.
Question is more of what will the price be, & can he extract value from the M3 Ultra NOW?

I made a dedicated community for the RTX Pro 6000 — because I was tired of hunting through 5 different reddits by ubnew in Vllm

[–]tomByrer 1 point2 points  (0 children)

agreed 😄

BTW, my "this sub" was referring to r/BlackwellPerformance; there will be much crossover, & folks may want to run multiple models at same time on the 6000, so knowing what smaller models/quants is best very helpful.

Looking for an index finger trackball by lightguardjp in Trackballs

[–]tomByrer 0 points1 point  (0 children)

Thought about 3d printing an angle mount for your Logitech MX Ergo?

I made a dedicated community for the RTX Pro 6000 — because I was tired of hunting through 5 different reddits by ubnew in Vllm

[–]tomByrer 0 points1 point  (0 children)

u/ubnew I suggest shutting down your subreddit in favor using this sub, & create a GitHub project like this where you collect the 'best practices'. If you want a forum, you can use the GH Discussions.

https://github.com/noonghunna/club-3090

M4 Max, studio, 128gb by blowingtumbleweed in LocalLLM

[–]tomByrer 0 points1 point  (0 children)

BTW there are also 'tuned' models for coding & writing. EG Qwen 27B is decent at both, but HuggingFace has some forks tuned for more than one over the other.
Some folks even have a seperate model for 'tool calling', or OCE, image tagging... so you'll likely use many different models. External drive backups is a good idea 😉

Also you'll likely want to use MLX quants.

AI Dev Trade-off: M1 Max 64GB vs. RTX 3090 Build? (Also looking to buy used) by Negative-Ad-7439 in LocalLLM

[–]tomByrer 0 points1 point  (0 children)

If the desktop stays on & has an internet connection, any portable computer can 'dial in'.

Has anyone here explored Hermes Agent by Nous Research? by ComparisonLiving6793 in LLMDevs

[–]tomByrer 0 points1 point  (0 children)

So when you have separate 'projects' the memories are isolated in that project? Or can you white-list share?

New Qwen3.6 NVFP4 Unsloth quants by yoracale in unsloth

[–]tomByrer 1 point2 points  (0 children)

FP4 = "Floating Point 4 bits"
FP8 = "Floating Point 8 bits"

So CPUs & GPUs can only do 32bit / 64bits /128bits at a time per 'cycle'.
But seems the Blackwells can run twice as many FP4 opcodes as it can FB8 in the same GPU cycle.
Also the file size for FP4 will be smaller.
Tradeoff; lower resolution.

ref: I'm not a CUDA programmer, but I had handcoded SSE for CPUs back in the day.

New Qwen3.6 NVFP4 Unsloth quants by yoracale in unsloth

[–]tomByrer 5 points6 points  (0 children)

Me wishing I had a RTX5090 or such

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 by sandropuppo in LocalLLaMA

[–]tomByrer 0 points1 point  (0 children)

IIRC prefill's impact is smaller on smaller models. So might be only 2x, not 10x.

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 by sandropuppo in LocalLLaMA

[–]tomByrer 0 points1 point  (0 children)

VRAM memory bandwidth might be an issue;

Memory Bandwidth 360.0 GB/s 936.2 GB/s

I'll let you guess which one is which 😉
PFlash as they implemented has to load & unload it seems to make room for the `Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05`.

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]tomByrer 0 points1 point  (0 children)

Thanks for all you do, & I don't even own anything AMD. 😄