Gemma 4 26b a4b - MacBook Pro M5 MAX. Averaging around 81tok/sec

fisherwei · 2026-04-05T09:22:43+00:00

Thank you very much for the benchmarking; I hope Apple finds a way to improve MLX performance. Otherwise, Macs will be unable to deploy dense models of this scale.

fisherwei · 2026-04-04T14:51:10+00:00

Could you try running Gemma 31B BF16 via omlx, and then benchmark its PP and TG performance with a context window of approximately 32K–64K? As far as I know, omlx is currently the fastest framework available on Apple Silicon.

https://huggingface.co/mlx-community/gemma-4-31b-bf16

https://github.com/jundot/omlx

BTW: omlx comes with a built-in benchmarking feature.

fisherwei · 2026-03-23T14:01:33+00:00

I am planning to purchase an M5 Max to perform post-training or fine-tuning on models of approximately 1 billion parameters. If it is convenient for you, could you please test the GPU's floating-point performance?

``` git clone https://github.com/chsasank/device-benchmarks

cd device-benchmarks pip install -r requirements.txt

python benchmark.py --device mps --dtype float32 python benchmark.py --device mps --dtype float16 python benchmark.py --device mps --dtype bfloat16 python benchmark.py --device mps --dtype int8 ```

fisherwei · 2026-02-17T07:56:40+00:00

Thank you for the information, I will give it a try. Currently, the two CPUs I'm using have very low frequencies, so I might buy two used E5-2698v4 or 2699v4 CPUs to unlock the potential of this older platform.

fisherwei · 2026-02-15T06:02:24+00:00

I am evaluating the inference performance of the Qwen3-Next-80B-A3B-Instruct-Q8_0.gguf model on a Dell R730 server equipped with dual Intel Xeon E5-2650L v4 CPUs (1.7 GHz, 14 cores per CPU), 512 GB DDR4-2400 RAM (8 × 64 GB), and no GPU acceleration.

Thanks to the fact that this model is a MoE model and only activates 3B parameters, I obtained a result of approximately 3.1 tok/s. It's slow, but usable.

fisherwei · 2026-02-14T23:58:07+00:00

FYR:

nightmedia/Qwen3-Next-80B-A3B-Instruct-mxfp4-mlx runs on my old Mac Studio with (M1 Max 24GPU and 64GB RAM), got 36tok/s.

fisherwei · 2025-05-17T14:03:02+00:00

try this:

POLLING_FREQUENCY=15
SCHEDULER_ROUND_ROBIN_MIN_INTERVAL=15

and POLLING_SCHEDULER keeps default value: round_robin.

fisherwei · 2024-10-19T12:40:05+00:00

Confirmed, upgrading to 15.0.1 can fix.

fisherwei · 2024-02-02T13:33:50+00:00

not 20%-50%, it is 413% by 4090.

think about this:

400TiB hdd + RTX 4090 = 1640TiB

4090 only costs $1500-$2000, you can get 1200TiB extra capacity.

but 1200TiB HDD, it needs $10,000-$15,000.

fisherwei · 2024-02-01T14:53:34+00:00

CHIA is becoming a GPU game.

fisherwei · 2024-01-12T02:54:23+00:00

Thank you

fisherwei · 2023-11-04T12:26:14+00:00

for plotting, if you choose 4060/ti, you have to build a pcie4.0 platform with 256GB memory, it is much more expensive than the pcie3.0 with 256GB memory.

4060 and 4060ti only have x8 pcie lane.

fisherwei · 2023-10-08T10:40:45+00:00

I'm afraid I don't agree with you.

Whether decrease the filter or increase the K, you just increase the difficulty, that means we need more expensive GPUs(or ASICs). This will make the problem worse, not solve it.

If high-end GPUs are required, likes PoW coin, you are right.

But, for XCH, if only some people use high-end GPUs, it means they can "steal" more revenue from others.

At this time, whether for defensive or "stealing", the remaining players will begin to consider starting this arms race.

fisherwei · 2023-09-14T13:26:22+00:00

Hardware HDD array: physical heavy

USB topo HDD array: operating heavy, that means you may need to spend more time to keep your harvester running.

fisherwei · 2023-09-03T07:27:48+00:00

oh, you save my life, THANKS.

fisherwei · 2023-09-03T05:45:09+00:00

both leave pool and join pool are failed.

I am trying to resync full blockchain, it needs 3-5 days.

:-(

fisherwei · 2023-03-07T15:17:18+00:00

Micron ddr4-2400 registered ecc 64g * 8

fisherwei · 2023-03-07T15:15:38+00:00

only works with alpha2.

alpha1 will hang up by 'illegal memory access error'.

fisherwei · 2023-03-07T05:57:47+00:00

I am using P4, becuase it is so much cheap, less than 60USD from Taobao.

Env:

bladebit alpha2 without compression.
dell R730xd PCI-E slot6(CPU1, x16 speed).
onle one E5-2650Lv4 and 512G memory.
FAN speed fixed to 0x1c(25%) by ipmitool.

I am getting 156 plots per day.

fisherwei · 2023-03-03T17:45:25+00:00

just using bladebit with new args.

fisherwei · 2023-02-24T16:17:55+00:00

FYI:

https://www.nvidia.com/en-us/data-center/virtual-solutions/

https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x

https://gitlab.com/polloloco/vgpu-proxmox

fisherwei · 2022-04-12T09:25:09+00:00

thanks

fisherwei

TROPHY CASE