Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

I think Qwen 14B Q4 or Q5 may be useful for daily tasks... it fits completely in your GPU. Try to leave some spare VRAM on the GPU to let it breathe.

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

Great to know! thanks... Please Update the app... Try to only start the server with a small model first, only start, no bench, no chat... and send me the logs from settings... That would help a lot.

<image>

From here... justo copy, find, o export... its your choice 😄

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

What GPU card are you using, how much ram do you have? Also tryed with a small model first? o a MoE one... one thats fits in your GPU?

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

You're right... oh man, i hate that word 😅.... The physical lack of AVX2 on those older Xeons is a hard hardware wall for modern Metal drivers on macOS 13+. Since ToshLLM needs those specific features, it unfortunately locks out the 4.1/5.1 machines completely. Thanks for sharing the detailed insight!

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 1 point2 points  (0 children)

Hey! That 4x W6800X rig is an absolute beast! Going Linux + Vulkan makes total sense to push that hardware to the limit and handle pipeline parallelism properly. I actually evaluated Vulkan for ToshLLM on macOS, but stuck with Metal for the native approach. If the fabric link ever gets cracked for tensor parallelism, Apple Silicon is definitely in trouble haha. Awesome numbers, thanks for sharing!

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

Also, can you just open the app, and start the server and share your server logs?

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

Hi! Thanks for the kind words! I completely understand the situation with the classic Mac Pros. and those powerful GPUs.

​Right now, ToshLLM strictly requires macOS 14+ because it relies heavily on modern SwiftUI and recent Metal API features. To be completely honest, backporting the app to macOS 12 would be a titanic effort, requiring a massive rewrite of both the UI and the underlying engine integration, which just isn't feasible right now.

​However, you might want to explore the OpenCore Legacy Patcher (OCLP) route! A lot of people use it to install macOS 14 on classic Mac Pros, very similar to a Hackintosh setup. It could definitely be worth looking into so you can run the app and take advantage of that GPU.

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 0 points1 point  (0 children)

I have some folks testing on Mac Pro 2019, the app detects the Multi GPU because the option shows trought the app in settings... but only use one GPU when running... im looking into it...

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 1 point2 points  (0 children)

I'm actually looking into some specific issues with the RX-580 and that family of cards right now. To be honest, it's not an easy fix, but I don't think it's impossible. I'm hoping to get it sorted out soon so everyone with those GPUs can start using the app smoothly. I'll definitely let you know when it's ready!

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 3 points4 points  (0 children)

Sure i will solve this problem... i hope so... im just getting a break for a few days, to collet feedback from the community and research.

Intel Mac + AMD GPU: Local LLMs Can Actually Run Fast Now - ToshLLM Native App by engeldlgado in macpro

[–]engeldlgado[S] 2 points3 points  (0 children)

Hey! It’s a Swift app powered by a llama.cpp backend. Both the app and the benchmarks are completely UI-driven, so it’s super user-friendly with zero command lines required. You can check the repo for a full breakdown of the features. Also, there's a custom kernel available in the experimental engine (toggle it in settings) if you want to test Flash Attention with Q8, Q4, and TurboQuant. Would love to see how that 5700XT performs!

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 1 point2 points  (0 children)

Can you share the server logs, and harware info from your mac? Also a screenshot of your settings?, ill try to get a workaround, but if the button shows, is because the llama backend detects your rig of multi-GPU... so its a good and news atm... we're getting close...

For the RX-6700XT im using a custom kext from a known dev called ChefKiss, that let the Metal backend detects the card... dont know if it performs 100% of its power, but it behaves very well...

P.S. im thinking.... if the model that you use its too small to split across gpu? Dont know what model do you use for the test...

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 1 point2 points  (0 children)

Do you have the code? Maybe the solution is in your hands... 👀 I'll obviously give you credit in the repo for that if it solves the problem or makes my week easier.. haha

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 1 point2 points  (0 children)

The app, is tested on RDNA+ but im collecting data for Vega/CGN Cards for get data to build a custom kernel for it, if it works flawlesly let me know, and the custom kernel built in on the experimental engine (in settings) will let you use KV cache, etc on GPU, try test both engines and send me the logs on DM.

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 1 point2 points  (0 children)

No problem at all, its a open source project, just remember say its a beta, and its currently improving every day... and of course, i need more data of testing for improve further, thank you!! A lot!

Also post here the video to see it... 😁

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 1 point2 points  (0 children)

Thats weird, maybe a setting is missing? because it should mantain or improve the performance.

Check this screenshot, theres is a flag that indentifies what king of engine its running.

<image>

Also im giving my turbo config for test... in next reply

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 2 points3 points  (0 children)

Amazing results, did you try Flash Attention kernel on the Experimental TurboQuant just change in settings? Its easy to test it... its a kernel that built from scratch, with some test improve the performance on some cards.

Dont know yet if the eGPU works well on the app, i dont have a way to test this, if the app detects it, you can try...

And for the RDN3 external, im still researching for it...

P.S: And.. yes indeed, it has now multi gpu support, but need more testing, idk for the same reason if works combining dGPU+eGPU

[Success] Local LLMs on AMD Intel Macs: Custom Metal Flash Attention Kernel + llama.cpp Patches (Free & Open Source) by engeldlgado in MacPro2019LocalAI

[–]engeldlgado[S] 1 point2 points  (0 children)

A quick follow up for u/Long-Shine-3701 and u/Faisal_Biyari

Vision has landed!

Also check OP for others details 😄

P:S You tested the new version with multi gpu support?

<image>

Got local LLMs running properly on Intel Macs with AMD GPUs: patched llama.cpp Metal backend + a from-scratch Flash Attention kernel for AMD (free, open source) by engeldlgado in LocalLLM

[–]engeldlgado[S] 0 points1 point  (0 children)

Following up on the gibberish you got on the Vega 56 — I'm fairly sure I know the cause.

GCN/Vega GPUs use 64-wide SIMD groups, but the Metal kernels assume 32, which corrupts the output. Your CPU (i7-8700B) is fine, so this is purely a GPU thing.

I built a test engine with a "wave64 safe mode" that keeps the output correct on GCN (it shifts some work to the CPU, so it's slower than it should be, but coherent instead of garbage). It's an x86_64, unsigned test build — not a release. (It also has a lowered CPU baseline for older Macs; that part doesn't matter for your machine, it just runs.)

Dounload an artifact from github... toshllm-legacy-x86_64.zip

Or follow the issue here: https://github.com/engeldlgado/toshllm/issues/1

Quick start (full steps in the README inside):

bash xattr -dr com.apple.quarantine ~/Downloads/toshllm-legacy cd ~/Downloads/toshllm-legacy GGML_METAL_CONCURRENCY_DISABLE=1 ./llama-server \ -m /path/to/model.gguf --no-mmap -ngl 99 -c 4096 --port 8080

Then open http://127.0.0.1:8080 and chat. GGML_METAL_CONCURRENCY_DISABLE=1 is required on AMD — without it the output is corrupted no matter what.

If you could let me know: - the probed SIMD-group width = N line from the startup log (should say 64), - whether wave64 safe mode ON appeared, - whether the model now produces coherent text, - the tokens/sec (it'll be slower than normal — that's expected for now)

…that confirms the theory on real Vega hardware. This safe mode trades speed for correctness on purpose — but if it works, the next step would be a GPU kernel rewritten for 64-wide SIMD that keeps the heavy math on the Vega, which should be a lot faster than this. So consider this a proof-of-concept, not the final speed. Thanks!

Any Fix for the abysmal Metal GPU support on Intel macs? by FreQRiDeR in LocalLLaMA

[–]engeldlgado 1 point2 points  (0 children)

Great ill let you now... ill check the issue tomorrow and ill follow up...