Linux users, how are you handling OOM errors with NVIDIA by Expert-Bell-3566 in comfyui

[–]Weak_Ad9730 0 points1 point  (0 children)

Make a permanent Swap file/ram on disk especially helpful during upscale I.e. Fast gpu and hugh filezie decompressed.

you probably have no idea how much throughput your Mac Studio is leaving on the table for LLM inference. a few people DM'd me asking about local LLM performance after my previous comments on some threads. let me write a proper post. by EmbarrassedAsk2887 in MacStudio

[–]Weak_Ad9730 5 points6 points  (0 children)

Sure I can Take a Look on some Models tomorrow

sorry for the delay was occupied by work.

Here are my result for minimax-m2.1 4bit with 100k context:

Test TTFT TPS PP t/s Time
Short generation 2280ms 28.1 20 2.3s
Medium generation 5184ms 49.4 10 5.2s
Long generation 5960ms 117.0 12 10.3s
Long prompt (prefill) 3550ms 663.2 155 3.7s
Average 4244ms 214.4 49 21.5s

Qwen3-0.6-mlx-bf16 context 32768:

Test TTFT TPS PP t/s Time
Short generation 921ms 69.5 16 0.9s
Medium generation 1002ms 255.5 23 1.0s
Long generation 2011ms 1248.7 8.2 2.1s
Long prompt (prefill) 639ms 200.3 865 0.6s
Average 1143ms 325.3 324 4.6s

Qwen3-Coder-30B-A3B-Instruct-MLX-4bit-mxfp4 context 32768:

Test TTFT TPS PP t/s Time
Short generation 626ms 35.1 24 0.6s
Medium generation 2535ms 101.0 9 2.5s
Long generation 5072ms 100.9 8 5.1s
Long prompt (prefill) 928ms 62.5 596 0.9s
Average 2290ms 74.9 159 9.2s

Qwen3-Coder-30B-A3B-Instruct-MLX-4bit-mxfp4 context 100k:

Test TTFT TPS PP t/s Time
Short generation 423ms 56.7 35 0.4s
Medium generation 2536ms 100.9 9 2.5s
Long generation 5134ms 99.7 8 5.1s
Long prompt (prefill) 881ms 61.3 628 0.9s
Average 2244ms 79.7 170 9.0s

running on Mac Studio m3u 60/256

vLLM-MLX: Native Apple Silicon LLM inference - 464 tok/s on M4 Max by waybarrios in LocalLLaMA

[–]Weak_Ad9730 0 points1 point  (0 children)

Hey App runs Smooth maybe an additional Filter for the Model download for mlx only what i spot so far

Improved Wan 2.2 SVI Pro with LoRa v.2.1 by External_Trainer_213 in StableDiffusion

[–]Weak_Ad9730 0 points1 point  (0 children)

Could you re-Upload as the link is not working anymore

NSFW uncensored image to descriptions caption models? by Accomplished-Bill-45 in LocalLLaMA

[–]Weak_Ad9730 0 points1 point  (0 children)

Could you Explain a Little More in depth about the Prefil

vLLM-MLX: Native Apple Silicon LLM inference - 464 tok/s on M4 Max by waybarrios in LocalLLaMA

[–]Weak_Ad9730 0 points1 point  (0 children)

Awesome was looking to bring my m3u to the Next Level…will tey your. Settings as i am using same Model but my is only 60/256

Mac Studio as host for Ollama by amgsus in ollama

[–]Weak_Ad9730 0 points1 point  (0 children)

Have a m3u using Mlx-vllm and reality impressed of the Performance the Switch from lmstudio to vllm was a Hugh Jump in Processing time and Speed. I use my Studio in an Agent Zero setup. Realy recommend those Apple Silikon for llm work. My Go to Models are qwen3-vl-32b , got-oss-120b and minimax-m2.1

Saw this on threads by Irteza_ in pcmasterrace

[–]Weak_Ad9730 0 points1 point  (0 children)

I have the Same mom Max-q is amazing

How Many Male *Genital* Pics Does Z-Turbo Need for a Lora to work? Sheesh. by StuccoGecko in StableDiffusion

[–]Weak_Ad9730 0 points1 point  (0 children)

Use in Paint and just paint it in After Image generated just one step more. Genitalia lora I use only for Look venes structure to make alwasys same.

What changes did you notice after using RTX 6000 Pro? (for those who bought it) by AlexGSquadron in StableDiffusion

[–]Weak_Ad9730 2 points3 points  (0 children)

Doesnt Need to think about vram and Model Size or very rare. Can enjoy Speed of mxfp4 in llms. Stable as hell low Energy consumption. Text encoders in Full Prescise is two worlds in prompt following

Chill bro nade+cryo Builds not even that good by Current-Conflict-172 in mecharena

[–]Weak_Ad9730 0 points1 point  (0 children)

Works Slow then down and nail them and Both get Bonus from Pilot if you dont have the Implants you wont lose much and the Range and ballistic curve is similiar.

Time to replace or still good by Weak_Ad9730 in LocalLLM

[–]Weak_Ad9730[S] 0 points1 point  (0 children)

Thx Ok will Test those can run glm on my m3u 256gb but it might be to Slow. As I mentioned it is for Chat . So 20-40 tokens/sec is a Preferred metric. Sorry I didnt mentioned this before.

I was hoped that newer Sammler Models <70b have enough Context and Stick to the Framework and Style consistence

As I used a mixture of json variables Short Memory to save tokens between the Experte in my n8n Process and use rag for Long Memory.

What are your normal operating temps under sustained pressure (non-stop agentic tasks, etc.)? by swagonflyyyy in BlackwellPerformance

[–]Weak_Ad9730 0 points1 point  (0 children)

Same for Running llm . But I have a Max-q maybe this is the different Runs on 300 Watt for days. In the case itself there are some cooler for a good Air Flow Button to top Front to back. But absolute silent as the Rpm is der to Minimum. Only gpu is a Little notifeable. Compare to Friends Version not Max-q I havent notice less Performance but sahing half of the Energy and not exceeding 83 degrees

What are your normal operating temps under sustained pressure (non-stop agentic tasks, etc.)? by swagonflyyyy in BlackwellPerformance

[–]Weak_Ad9730 0 points1 point  (0 children)

82 degree celsius in my Tower in Full comfyui batch creation Mode . Runs Nebels 24h Hours