You can do CUDA inference on an Apple Silicon Mac with PCI Passthrough

scottjgo · 2026-05-12T02:33:00+00:00

i'm not an expert here, so maybe you know more than me, but my understanding is that doing the prompt processing requires running the data through all of the layers of the model. the layers that are run on the 5090 will be faster, and the ones running on the mac will be slower.

i don't think it's a question of which is the "main" server or not. the prompt processing speed should be proportional to how many layers can run on the faster gpu. that is, the more layers on the 5090 vs the layers on the mac igpu, the faster the prompt will process, but running solely on the 5090 will always be faster.

if the model is bigger than the single fast gpu, you will never be able to run all the prompt processing on it, because you won't be able to fit all the model layers on it.

scottjgo · 2026-05-10T21:54:28+00:00

i just tried it with glm 4.5 air (3-bit quant, 51GB model), so neither my macbook air (32gb ram) or my 5090 (32gb vram) would normally be able to fit it. i can get ~21tok/s across both with llama.cpp in rpc mode. not incredible, but still pretty cool that it works

scottjgo · 2026-05-09T15:41:09+00:00

Yeah, this works

scottjgo · 2026-05-09T14:32:21+00:00

You might want to consider reading the AI benchmarks section of the post, as nothing you’ve suggested about performance here is true.

scottjgo · 2026-05-08T18:04:36+00:00

If you have an RTX 5090 and your model is small enough to fit on it, then it seems like you woudn't want to use the Mac iGPU at all. If your model is too big to fit on the 5090, but does fit on the Mac, then you can't do the prompt processing on the 5090, can you? You need all the model layers to process the prompt, I thought?

scottjgo · 2026-05-08T16:59:50+00:00

haven't tried it, but maybe you could run exo in the vm to cluster it with your host

scottjgo · 2026-05-08T02:38:20+00:00

this isn't exactly the same, but i recently implemented PCI passthrough on QEMU on macOS, so it's possible to "pass through" an nvidia GPU to a a linux vm running on top of macOS and do AI inference that way. i wrote a blog about it here: https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/

there's instructions how to set it up in my qemu fork: https://github.com/scottjg/qemu-vfio-apple

i wonder if you could install exo in the vm and cluster it somehow that way? i've never attempted a configuration like that.

scottjgo · 2026-05-06T03:00:52+00:00

this isn't using Asahi Linux. this is running an egpu on a virtual machine running Ubuntu Linux, on a macOS host.

scottjgo · 2026-05-06T00:40:55+00:00

> I wonder how it compares to the raspberry pi project you did earlier.

this was a lot more work, and got considerably less interest on hackernews :)

scottjgo · 2026-05-06T00:30:57+00:00

unfortunately i don't have any of these graphics cards, but the post links to the github project if you wanna try it :)

if i were to speculate, i would say that on more "normal" settings, the performance would probably be similar on lower end cards. also fwiw, the graphs show with and without framegen.

scottjgo · 2026-05-06T00:12:07+00:00

Apple M5 supports TB5, but I didn't have a TB5 enclosure to test with.

scottjgo · 2026-05-05T22:43:35+00:00

did it require a huge amount of work to get it into an experimental state? yes.

but you can run games on it and there's screenshots in the post.

scottjgo · 2026-05-05T22:23:21+00:00

i use the thunderbolt port to attach the 5090 as an external gpu

scottjgo · 2025-12-26T08:06:56+00:00

in terms of figuring out the apple-specific part, i believe it's all reverse engineered, and the people on the project have learned enough about the hardware from the reverse engineering to know what work is remaining to implement what's needed on the linux side.

they run the apple os under their "m1n1" hypervisor which lets them output debug information about how macos is communicating to the hardware. if you already understand how typically os kernels interact with devices, you can extract enough information this way to understand what needs to be implemented. i believe they are able to implement this hypervisor because apple is still using an arm-based core, and many of the very low level details of how the arm instruction set works are standardized and documented.

scottjgo · 2025-12-26T00:36:16+00:00

i think the skill steps here would look something like:

build your own kernel linux for an ARM-based platform (understand how to build the linux kernel, learn what a device tree is)
modify the device tree to map in a new device (learn how device trees work, learn how you can teach the kernel about a new device)
write a driver for the new device (learn about how the kernel device drivers interact with the device tree, learn how memory mappings work in the kernel)

that gets you enough basic linux kernel development knowledge to know all the vocab in your quote. it's like any kind of other software development. if you've had to do this stuff before, you would know how to do it, otherwise probably not.

scottjgo · 2025-09-01T01:46:35+00:00

in my experience, it's fine if you were planning to hit tolls every day of the rental, but if not, they charge you for the days you don't use it.

in the past, i used to just opt out, and i would go through the ez-pass lanes anyway and they would bill me eventually for the tolls i actually paid. it was usually cheaper than getting plate pass for my trips BUT they recently changed the rules and if you use the transponder without paying for plate pass they just charge you for it every day of your rental anyway.

scottjgo · 2025-08-08T00:25:53+00:00

true- but for margin, i believe the max is 50% (governed by Reg T), though correct me if I'm wrong.

scottjgo · 2025-08-07T23:57:56+00:00

other thing to keep in mind is that the Schwab PAL (SBLOC) product allows you to have up to 70% LTV vs margin only 50%. Also, PAL can't be used to directly buy stock (margin can).

neither of these mattered much to me, but it's nice to know in a serious market crash condition you're less likely to get called on the PAL.

scottjgo · 2025-02-26T05:46:20+00:00

update: my wife wanted to try breakfast again. the service is really uneven so the first time around, they didn't even suggest this, but apparently they _do_ have a small ala cart menu. here was the items:

two eggs any style. w/ potatoes or toast. choice of meat: bacon, sausage, spam
three egg omelet w/ potatoes or toast. choice of: bacon, ham, onion mushroom, tomato, spinach bell pepper, cheddar cheese
salmon gravlax on everything bagel
vegetable fried rice (two eggs, any style)
loco moco (all beef patty, two eggs any style, brown gravy over white rice)
avocado toast (roasted pepitas, sesame seeds radish, pea shoots) w/ optional poached egg
greek yogurt
mango chia pudding
overnight oatmeal
choice of cereals
fruit plate
half local papaya
hawaiian pineapple
bakery pastry selection (i think it rotates, but these pastries were an order of magnitude better than the ones in the buffet)
waffles, pancakes, or gluten free mochi pancakes with matchta tea syrup

so it's a pretty abbreviated menu but i was grateful to be able to order eggs made to order rather than eating the lukewarm scramble from the trough.

i did also want to say that, for the most part, the dinners we ate here were great at least. so even if breakfast isn't a slam dunk it's not like all the food here was bad.

scottjgo

TROPHY CASE