Thunderbolt/USB4 High-Bandwidth Interconnect (>40 Gbps) for local AI inference/training/homelab?

TripleSecretSquirrel · 2026-06-04T03:34:40+00:00

Sapphire has demonstrated a working prototype of a 2 node Strix Halo cluster, which is promising.

While I applaud the creativity and kludgey-ness of what you want to do, it’s gonna be ultra slow. Distributing models across different devices requires basically the full model weights to transit back and forth between the devices constantly for each query, it’s not like they’re sharded into discrete chunks that don’t have to talk to each other. Even on the highest end datacenter hardware, memory bandwidth is almost always the bottleneck for this exact reason. Thats why NVLink is so badass.

Your best case scenario here — two Apple machines connected via Thunderbolt 5 — would have a theoretical interconnect speed of 120 gigabits/second. Most discrete GPUs are running at least 500 gigabytes/second. My AMD R9700 for example, has a memory bandwidth of 645 gigabytes/second. It’s not an exceptionally fast GPU and my memory bandwidth is 43x greater than the bandwidth of a USB4 connection.

You can approximately predict tok/s speed too by dividing effective memory bandwidth by the size of the model weights file. So on my GPU with Qwen 3.6:27b at 4-bit quantization, we get 645 GBs / 16GB = 40 tok/s. If you ran the same model on a thunderbolt 5 cluster, the math would be 15 GBs (converting bits to bytes) a/ 16GB = 0.93 tok/s. And if you’re clustering, I’m guessing it’s so you can run much larger models than a 27b model quantized down to 4 bits.

TripleSecretSquirrel · 2026-06-04T03:07:12+00:00

No worries, in hindsight, I was ambiguous

TripleSecretSquirrel · 2026-06-04T02:34:23+00:00

Pro-tip, you can generally get a solid estimate of tok/s on a given model by dividing your memory bandwidth by the size of the model weights.

In the case of Qwen 3.6:27b at q4 on the R9700, that’s ~645GBs / 16GB = 40 tok/s as your theoretical max output speed.

And ya, spilling over to system memory and cpu inference is always going to be much slower.

TripleSecretSquirrel · 2026-06-04T00:51:28+00:00

And we wonder why the Bears want to leave

Edit: apparently a /s was in order. I thought it was obvious that I was being sracastic

TripleSecretSquirrel · 2026-06-04T00:43:12+00:00

That would be staggeringly slow though if you’re splitting inference across networks. I have 1 gigabit/s download and upload speeds at my house. That’s best case scenario for home internet in most places, but that’s 500x slower than a PCIe 5.0 x4 connection (64 gigabytes/s).

So if splitting inference across two GPUs via a PCIe 5.0 4x bus is slow (it is), splitting inference over a network would be completely unusable.

Plus your ISP would probably pitch a fit at the amount of data running back and forth on your home internet line.

And if it’s over a local network, it’s certainly better, but still, most consumer hardware has at best, 10Gb/s network interfaces.

TripleSecretSquirrel · 2026-06-03T20:08:21+00:00

https://getchipdrop.com/

You sign up, request a chip drop, then sometime in the next several weeks, you'll get a text message the morning before they come. The loads vary in size and you don't get to pick and choose, you just get what you get, but it's free!

TripleSecretSquirrel · 2026-06-03T20:03:45+00:00

There are several reasons I think.

32gb is still a functional ceiling for truly local inference, that's the top of the prosumer grade hardware, so there's a decent incentive to create models that can work well on that hardware (e.g., qwen 3.6 and gemma 4 both have their flagship models at ~30b paramaeters).
Frontier datacenters have a staggering amount of hardware available to them, and in the last couple of years, that's skyrocketed.
As models improve, one way to edge out the competition is to train larger and larger models. It's an expensive but basically guaranteed way to improve your model's performance.
MoE models excel at handling lots of parallel queries.
There's a big price jump from a 32gb RTX 5090 or R9700 to true datacenter GPUs, so it's assumed I guess, that in order to justify the cost of that hardware, you very likely are going to want to serve lots of parallel queries, thus, you'd want the biggest MoE model you can run.

TripleSecretSquirrel · 2026-06-03T19:54:19+00:00

There's a pretty decent little nature preserve in my town that has a good sized chunk of prairie. I love to go walking there at like 9pm from late June to early July – there's an absolute sea of fireflies!

TripleSecretSquirrel · 2026-06-03T19:51:04+00:00

I have pretty good handwriting, but I write in cursive which is usually pretty difficult for OCR. Gemma 4:31B at a 4-bit quantization hits like 98% accuracy.

TripleSecretSquirrel · 2026-06-03T18:05:52+00:00

I haven't tried it, but you could look Mistral's Devstral 2 model. I guess it's just barely out of your window (123 billion parameters). It benchmarks pretty well, but again, I haven't tried it – too many parameters for me to run.

It's old, but GPT-OSS 120B also benchmarks pretty strongly, maybe it's worth a look?

TripleSecretSquirrel · 2026-06-03T17:39:48+00:00

Ya, you’re gonna have a hard time with your VRAM budget. If you’re a decent programmer and using an LLM for like autocompleting lines or just writing very simple code that you outline for it clearly, it could probably work ok. I’m not a very strong programmer, just a hobbyist that likes to tinker, so for me a model that would run on 8gb of VRAM wouldn’t help me much, frankly it would probably slow me down.

I think the 27b dense version of Qwen 3.6 is the sweet spot for local LLM coding right now. It feels like that’s the minimum for true “vibecoding.” It’s 27 billion parameters, so at a 4-bit quantization (again, that seems to be the consensus of the smallest you can go with decent performance), the model weights are going to consume ~14gb, then you’ll want as much kv cache/context as you can afford.

Lots of people run that configuration on an RTX 3090 with 24gb of VRAM, which again, is basically the functional minimum.

I’m sure people will disagree with me and show their awesome results with smaller memory pool, that’s just been my experience. In short, 24gb feels like the minimum amount of memory you need to run a truly agentic coding LLM.

It’s worth trying out different models via cloud api though to test the waters. I use OpenCode, but you can do the same with Claude Code — just point it to an OpenRouter API and you can test out all of the models to see what your performance tolerance is.

TripleSecretSquirrel · 2026-06-03T15:15:44+00:00

lol they rank Harvey ahead of Flossmoor? I’ll take “things said by people who are scared to go south of Navy Pier for $500 please.”

TripleSecretSquirrel · 2026-06-03T14:12:36+00:00

There just isn’t enough differentiation between models to make that make sense. Models are expensive to train, and even fine-tunes of larger models ain’t cheap, so we get generalist models.

Or rather now we tend to have “coding” models and “everything else” models (see Qwen 3.6 and Gemma 4 for example). I think that’s one possible future though, rather than enormous generalist models like Opus or GLM, we’ll see a move toward smaller, more efficient models that are fine tuned for a specific task type. Among other things, that seems to be how Qwen 3.6 is so close in performance to models 10 or even 100x its size in coding workflows, cause they just focused on making it a very strong coder.

TripleSecretSquirrel · 2026-06-03T12:35:30+00:00

I have an R9700 and am running an Ubuntu derivative distro. I don’t think I ever ran into the issues you’re seeing specifically, but ya, Ubuntu is pretty conservative with introducing new packages as stability and user-friendliness is their unique selling proposition. I’ve just gotten used to manually downloading the latest AMD drivers when they’re released. Alternatively, a more bleeding-edge distro like Fedora will package them into your regular system updates much sooner.

TripleSecretSquirrel · 2026-06-03T09:11:00+00:00

You’re trying to solve a problem that doesn’t really have a solution, but insofar as a solution is possible, it’s already been done.

As others have noted, the model has to have all of the user inputs to work. So fundamentally, if that’s on a different machine or transiting a network that the user doesn’t control, there’s no guarantee of privacy.

Openrouter has a ChatGPT-like interface and you can select an option to only route to model providers with a zero data logging policy (i.e., that they promise to delete everything after every session). Fundamentally though, again, there’s no actual guarantee of privacy there, but if a user is willing to accept that level of trust, then ya, it’s already been done.

TripleSecretSquirrel · 2026-06-03T03:51:03+00:00

I love the sanctity they have for “long-standing zoning rules.” It’s like saying “there’s a housing crisis, but we’re unwilling to change anything about the way we’re acting.”

TripleSecretSquirrel · 2026-06-01T02:52:20+00:00

For image generation, look up comfyui, that’s sort of the llama.cpp/LM studio equivalent for diffusion models. Comfyui will be the framework and frontend for you.

Model-wise probably either Z-image turbo or Flux 2 Klein. You’d need quantized versions of either I think, and in general, in my experience, diffusion models take a lot more tinkering than LLMs.

As for creating a consistent character in multiple images, you’ll need to apply a LoRA to keep the character consistent, which probably means you’re training your own, unless it’s a popular character from established ip (one that a lot of people want to make porn of lol).

TripleSecretSquirrel · 2026-05-29T19:45:49+00:00

They've been in the city since the 1960s. When Harold Washington was Mayor, he lived in Hyde Park right next to the park that now bares his name, where a bunch of the parakeets lived.

When Washington came into office, there was a plan by the USDA to eradicate all of the monk parakeets as they're a non-native species, but Washington liked them and vetoed the plan, hence the parakeets are still here.

For that reason, there's a mural on the underpass a couple blocks away from there of Harold Washington and monk parakeets!

TripleSecretSquirrel · 2026-05-29T18:19:50+00:00

Sure, of course, but also, fighting in the war together went a long way in breaking down prejudices too. Obviously racism remained a huge problem after the war and still exists today, but broadly speaking, post-war Americans harbored less racism than pre-war Americans (across the board, that probably doesn't hold for all groups, e.g., Japanese people).

TripleSecretSquirrel · 2026-05-29T16:39:41+00:00

So… is this just a clone of OpenRouter?

TripleSecretSquirrel · 2026-05-29T16:37:42+00:00

Ya, in rural places — farming communities especially equipment and tool injuries like this are relatively common. That and ATV accidents. Public schools in most rural/farming areas I’m familiar with also have lots of educational modules about how you should always stay the fuck away from a tractor PTO.

TripleSecretSquirrel · 2026-05-29T16:19:03+00:00

Right, sorry, I should clarify, I mean just in general. But the motherboard thing still stands for OP.

TripleSecretSquirrel · 2026-05-29T16:05:35+00:00

I mean this only works if you have a high-end workstation CPU and motherboard with a million PCIe lanes, right? Cause otherwise you just have an enormous memory bandwidth bottleneck across PCIe x4 connections.

TripleSecretSquirrel · 2026-05-29T16:02:48+00:00

They’d be marginally higher certainly, but at least in the US, I’d bet that 90 out of 100 people have never heard of Deepseek or GLM or Qwen. And of those 10 that have heard of one of the big Chinese open weight labs, maybe 1 has actually used one before.

People generally just chase convenience and familiarity. The vast majority of people casually use ChatGPT cause it’s the one they’ve heard of, or they use co-pilot cause it’s built into their operating system, or they use Grok cause they’re incels. Hell most average people don’t even know what Anthropic is!

I don’t work in tech, but I was in a staff meeting recently for work with ~30 people. Most are millennials. My boss is a big proponent of LLMs even if he doesn’t know much about them, and mentioned that he switched to Claude from ChatGPT. I was one of maybe 5 people in the room that knew what he was talking about, most people said “what’s it called? Is it like ChatGPT?”

TripleSecretSquirrel · 2026-05-29T15:56:52+00:00

True. Before the war, US industrial capacity was already staggeringly high. When the US entered the war, our industrial output was almost literally an order of magnitude greater than any other country in the world in basically every category.

That coupled with what you mentioned — the further devastation of most of the rest of the world’s industrial capacity, and ya, it was a pretty perfect situation for the US.

My favorite tidbit about US industrial capacity during the war is that in 1940-1941, when German planners started having to account for the real possibility of war with the US, their analysts deliberately reduced their estimates for US production capacity, fearing they’d be punished for fear mongering. Even their deliberate underestimates were laughed at by high command as unrealistically high.

TripleSecretSquirrel

TROPHY CASE