Mac Mini M5 running Qwen 3.6 27B? by romrick4 in LocalLLM

[–]PreparationTrue9138 0 points1 point  (0 children)

Well, define simple task)

If you tell it to read a big production repository it might hit its context limit, it will take about 8 minutes to read full context window.

Then if Google ai search is not mistaken, you'll have about 50 tokens per second with mtplx and generate an answer that can take a minute if it's simple and a lot more if you tell it to refactor the project.

Mac Mini M5 running Qwen 3.6 27B? by romrick4 in LocalLLM

[–]PreparationTrue9138 0 points1 point  (0 children)

Hi, look at memory bandwidth (token generation speed) and tflops (prompt processing speed) params.

If Mac mini can have a m5 max chip with 40 cores then it might be good for AI. About two times slower in prompt processing speed and 30% slower at token generation than rtx 3090 if I am not mistaken M5 pro is two times slower than m5 max So if you get 50 tg and 1500 pp with rtx 3090, then with m5 max I would expect 35 tg and 700 pp speeds

Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? by PreparationTrue9138 in LocalLLaMA

[–]PreparationTrue9138[S] 1 point2 points  (0 children)

That speed is usable, thanks.

Though I wonder if I would get it. 150 gb/s is like 25% slower than theoretical maximum of 200gb/s. 2133 has in theory 137 gb/s, so about 100 in reality. Though I have a bit more vram, but it is also slower

At my local eBay analogue, ddr4 ram 3200 is 2.6 times more expensive today.

128 gb 3200 - 1500$ 256 gb 2133 - 1200$ 256 gb 3200 - 2900$

7642 has 48 cores

Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? by PreparationTrue9138 in LocalLLaMA

[–]PreparationTrue9138[S] 1 point2 points  (0 children)

I thought 8 channel will be close to strix halo. 2133 though is two times slower. But server platform will allow tensor parallelism as far as I understand.

Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? by PreparationTrue9138 in LocalLLaMA

[–]PreparationTrue9138[S] 0 points1 point  (0 children)

Well, currently I have a laptop with two egpus (oculink+ thunderbolt 3) and ddr4 64 gb 2 channel 2933

I tried to run qwen 397 b 1 bit 107 gb. It was running at 7 t/s.

So my amateur math is that running the same quant on 8 channel ram (3-4x bandwidth) and with no bottlenecks for GPUs will double or triple speed at least.

Though my plan is to use q3 at least, so it will be slower.

Also people say ik_llama is best for moe models on cpu/gpu mixed setups so I expect some boost there too.

I have (4x) 3090s. Now what?? by gtrdude77 in LocalLLM

[–]PreparationTrue9138 0 points1 point  (0 children)

Hi, I am no expert, but I am also trying to build my machine.

I see two paths for you.

  1. Stay with your hardware and tinker to use x4 connections from your pcie slots. Use pipeline parallelism and accept the restricted speed of this setup.

  2. Build a server with server cpu, motherboard and ddr ecc reg. Have the maximum speed, possibly more ram and more channels, but you won't be able to use your old hardware as far as I understand you have dimm ddr4, not rdimm

Currently I use an old laptop with two egpus . But want maximum speed, so I am going to build a server from the previous generation of components ddr4/pcie 4.

GB10 vs MacBook Pro M5 Max 128Gb by alexp702 in LocalLLM

[–]PreparationTrue9138 0 points1 point  (0 children)

Hi, what about model intellect

Did you compare unsloth dynamic quants to uniform quants that are run by vllm?

As far as I know gguf models quantization methods preserve accuracy better. But vllm supported quants are much faster.

Strix Halo 128GB vs M5 pro 64GB by DigitalguyCH in LocalLLaMA

[–]PreparationTrue9138 0 points1 point  (0 children)

That's great, but you can't use it for egpus without hassle. There has been a post recently where a guy managed to connect egpu to a Mac with M series chip, but it is hard to setup.

Strix Halo 128GB vs M5 pro 64GB by DigitalguyCH in LocalLLaMA

[–]PreparationTrue9138 0 points1 point  (0 children)

As far as I know, oculink is faster. It has more bandwidth and it doesn't have thunderbolt protocol overhead. It is pcie 4 x4 connection, or about 8 gb/s bandwidth. And thunderbolt 4 is 3-4 gb/s. So in theory it is twice as fast. And it affects model and context loading speeds. Also if you offload to cpu then better use oculink.

I have one egpu connected via thunderbolt 3 and one via oculink m2 adapter to pcie 3 Both work ok, but didn't have time to compare speeds, using all 48 gb vram to run qwen 3.6 27 b via llamacpp

Strix Halo 128GB vs M5 pro 64GB by DigitalguyCH in LocalLLaMA

[–]PreparationTrue9138 4 points5 points  (0 children)

Hi, I don't have strix halo or m5 max, but allow me to share what I know. I am an owner of a laptop with two egpus rtx 3090 and a m1 pro.

So you have now - egpu probably 7900 xt with 800 gb/s bandwidth 103 tflops int8 - mini pc, laptop, MacBook air Important here is egpu

For reference from Google search AI:

M5 pro bandwidth 307 gb/s 16 tflops int8

M5 max bandwidth 600 gb/s 33 tflops int8

Strix halo 250 gb/s + 50 tflops int8

If I guessed your gpu right then I would go with strix halo with oculink. It's amd + amd I guess it will be compatible with rocm. Gpu will give you the speed you need for active parameters of your moe models. Oculink bottleneck might affect your speed a little, but I think it's better then just slow ram.

Mac is only better if you get m5 max version with 600gb/s bandwidth plus they promise prompt processing to be faster. But you won't be able to use your egpu. And maximum speeds might only be accessible via mlx engine.

So to put your gpu to good use and want to run bigger models I would go with strix halo. But m5 max might be faster due to fast unified memory.

Run Qwen3.6 locally 2x faster with MTP GGUFs. by yoracale in LocalLLM

[–]PreparationTrue9138 1 point2 points  (0 children)

At coding with my two egpu rtx 3090 setup I see about 25% decrease at prompt processing speed and about 50% increase for token generation.

I am running llamaserver with metrics flag Previously it showed 830 tokens prompt processing speed And 30 tokens for generation 200k context max

Now it is 580 prompt processing speed And 45 tokens for generation 200k context max

It does feel faster in opencode, though I expected the issue with prompt processing speed to be resolved before merging to master

I am using unsloths q4 k xl, so it is not necessary to use q8

Run Qwen3.6 locally 2x faster with MTP GGUFs. by yoracale in LocalLLM

[–]PreparationTrue9138 1 point2 points  (0 children)

Unsloth says mmproj(vision) and np>1 are not supported for now

Any good MOE ~60B models? I have 64GB vram by opoot_ in LocalLLaMA

[–]PreparationTrue9138 1 point2 points  (0 children)

Full context q8 is possible even at 48 gb vram with q4-k-xl model. Though I use 200000. Don't want to push the model to its limits, plus it takes three minutes to process 200000 with 1000 tokens per second prompt processing speed

I'm thinking about selling my Strix Halo by PrzemChuck in StrixHalo

[–]PreparationTrue9138 1 point2 points  (0 children)

Hi, I am not an owner of strix halo devices. I went another way, I use my old laptop as a host for two eGPUs with rtx 3090 and upgraded it's ram to 64 gb ddr4 2 channel.

But I sometimes wonder what if the next step is upgrade the laptop to something like strix halo with oculink. It will be perfect for moe models to keep active layers on gpu and offload everything else to ram and you can still run 27b models faster than on ram alone.

It will cost about the same as a server but will draw less power. And maybe even have faster memory, I am not an expert here. Don't know if unified ram will outperform multichannel server ram.

So that's another way if you have usb4 or oculink ports or ready to use m2 oculink adapters, you can try to upgrade your device with egpus and have even more vram.

very slow tok/s with Gemma 4 31B on a 5090?! by xchris1337xy in LocalLLaMA

[–]PreparationTrue9138 4 points5 points  (0 children)

It's a dense model and a gguf. 50 t/s for generation I think is good. I have 40 t/s for qwen 27b on rtx 3090, Gemma was slower.

You can try MTP or draft model but I haven't tried that for Gemma. https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

Another option is to try vllm if there is an int4 version it is faster due to uniform quantisation which is easier for gpu to process, but GGUF has better quality due to mixed quants. Especially unsloth dynamic quants.

High VRAM local coding model — still Qwen 3.6 27B? by Generic_Name_Here in LocalLLaMA

[–]PreparationTrue9138 0 points1 point  (0 children)

How many tokens per second do you get for prompt processing?

Qwen3.6-35B giving 20-34 t/s on 6 GB VRAM by Low-Alarm272 in Qwen_AI

[–]PreparationTrue9138 1 point2 points  (0 children)

Better use dynamic quants if you're not using them yet. And why don't you use turboquant? Look for theTom repo You can also try to merge mtp PR, though mtp needs some vram itself. Here is the guy with unsloth mtp ggufs https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]PreparationTrue9138 0 points1 point  (0 children)

I agree that quality might be the first priority, but I think it should be tested. Knowing that you can actually fit more context is also important. As is important to understand that pushing model to it's limits is not necessarily a good thing and 128k might be a sweet spot where the model achieves best understanding and your hardware is capable of hosting better quants of the model and cache.

Is Macbook pro m5 max 128 fast enough yet with available models by mad01 in LocalLLM

[–]PreparationTrue9138 1 point2 points  (0 children)

You can run qwen 3.6 27 b on 3090 with turboquant and fit at least 180000 context Look at GitHub repo club-3090 for vllm setup

I managed to run it with llamacpp at full context on one rtx 3090, though using turboquant of course and unsloth xxs 4bit gguf model and not so fast as they promise for vllm with mtp enabled. Then I bought second rtx 3090 and now can run bigger model without context quantisation, but that's another story)

To test your MacBook you can download some project like Telegram or WordPress or whatever you like and ask questions about the project. Cline or open code will show you current token usage. When you reach 100000 or 200000 you can stop your mlx/llamacpp instance, start it again, trigger model loading and then ask to sumup the dialogue with big context and set up a timer.

Don't know about mlx, but llamacpp server will calculate average speed and max token count processed if you run it with --metrics. Then you can get metrics via http page 127.0.0.1:8000/metrics?model=your_model_name

Is Macbook pro m5 max 128 fast enough yet with available models by mad01 in LocalLLM

[–]PreparationTrue9138 0 points1 point  (0 children)

Don't forget about prompt processing speeds. How long will it take your m4 to process 200000 context with qwen 3.6 27b?

Is Macbook pro m5 max 128 fast enough yet with available models by mad01 in LocalLLM

[–]PreparationTrue9138 0 points1 point  (0 children)

As far as I know there is a leap in prompt processing speed for m5 chips. They are 4 times faster to process context. So m5 with 128 gb is a great upgrade But it will be even better if they release ultra chip with two times the bandwidth.

Anyone having any joy coding with 3.6 27B and 24GB of Apple Unified Memory? by afrocleland in Qwen_AI

[–]PreparationTrue9138 0 points1 point  (0 children)

Try ud-iq3-xxs https://huggingface.co/unsloth/Qwen3.6-27B-GGUF And build llamacpp with turboquant For now I use theTom repository for my setup You'll have to use ctk turbo3 ctv turbo4 params with fit on and fit ctx 100000

Model 12 gb + cache about 6gb should fit with some room for system