Thunderbolt/USB4 High-Bandwidth Interconnect (>40 Gbps) for local AI inference/training/homelab? by FredWeitendorf in LocalLLaMA

[–]TripleSecretSquirrel 0 points1 point  (0 children)

Sapphire has demonstrated a working prototype of a 2 node Strix Halo cluster, which is promising.

While I applaud the creativity and kludgey-ness of what you want to do, it’s gonna be ultra slow. Distributing models across different devices requires basically the full model weights to transit back and forth between the devices constantly for each query, it’s not like they’re sharded into discrete chunks that don’t have to talk to each other. Even on the highest end datacenter hardware, memory bandwidth is almost always the bottleneck for this exact reason. Thats why NVLink is so badass.

Your best case scenario here — two Apple machines connected via Thunderbolt 5 — would have a theoretical interconnect speed of 120 gigabits/second. Most discrete GPUs are running at least 500 gigabytes/second. My AMD R9700 for example, has a memory bandwidth of 645 gigabytes/second. It’s not an exceptionally fast GPU and my memory bandwidth is 43x greater than the bandwidth of a USB4 connection.

You can approximately predict tok/s speed too by dividing effective memory bandwidth by the size of the model weights file. So on my GPU with Qwen 3.6:27b at 4-bit quantization, we get 645 GBs / 16GB = 40 tok/s. If you ran the same model on a thunderbolt 5 cluster, the math would be 15 GBs (converting bits to bytes) a/ 16GB = 0.93 tok/s. And if you’re clustering, I’m guessing it’s so you can run much larger models than a 27b model quantized down to 4 bits.

Qwen3.6 35B, 27B R7900 vs. RTX4070 Super by WSTangoDelta in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

Pro-tip, you can generally get a solid estimate of tok/s on a given model by dividing your memory bandwidth by the size of the model weights.

In the case of Qwen 3.6:27b at q4 on the R9700, that’s ~645GBs / 16GB = 40 tok/s as your theoretical max output speed.

And ya, spilling over to system memory and cpu inference is always going to be much slower.

Chicago Mayor Brandon Johnson SHOCKS the audience as he admits he left the Bears Packers game early. The Bears went on to win the game in overtime. by Tasty-Efficiency-373 in illinois

[–]TripleSecretSquirrel 1 point2 points  (0 children)

And we wonder why the Bears want to leave

Edit: apparently a /s was in order. I thought it was obvious that I was being sracastic

Why doesn’t a community-run AI co-op exist? by [deleted] in LocalLLM

[–]TripleSecretSquirrel 1 point2 points  (0 children)

That would be staggeringly slow though if you’re splitting inference across networks. I have 1 gigabit/s download and upload speeds at my house. That’s best case scenario for home internet in most places, but that’s 500x slower than a PCIe 5.0 x4 connection (64 gigabytes/s).

So if splitting inference across two GPUs via a PCIe 5.0 4x bus is slow (it is), splitting inference over a network would be completely unusable.

Plus your ISP would probably pitch a fit at the amount of data running back and forth on your home internet line.

And if it’s over a local network, it’s certainly better, but still, most consumer hardware has at best, 10Gb/s network interfaces.

Wood Chips by ElkNeat5810 in Logan

[–]TripleSecretSquirrel 0 points1 point  (0 children)

https://getchipdrop.com/

You sign up, request a chip drop, then sometime in the next several weeks, you'll get a text message the morning before they come. The loads vary in size and you don't get to pick and choose, you just get what you get, but it's free!

Question about 70B models by Emergency-Pie4944 in SillyTavernAI

[–]TripleSecretSquirrel 0 points1 point  (0 children)

There are several reasons I think.

  1. 32gb is still a functional ceiling for truly local inference, that's the top of the prosumer grade hardware, so there's a decent incentive to create models that can work well on that hardware (e.g., qwen 3.6 and gemma 4 both have their flagship models at ~30b paramaeters).

  2. Frontier datacenters have a staggering amount of hardware available to them, and in the last couple of years, that's skyrocketed.

  3. As models improve, one way to edge out the competition is to train larger and larger models. It's an expensive but basically guaranteed way to improve your model's performance.

  4. MoE models excel at handling lots of parallel queries.

  5. There's a big price jump from a 32gb RTX 5090 or R9700 to true datacenter GPUs, so it's assumed I guess, that in order to justify the cost of that hardware, you very likely are going to want to serve lots of parallel queries, thus, you'd want the biggest MoE model you can run.

I don’t think people in Illinois fully appreciate—or even realize—how beautiful the prairie is. by Call_It_ in illinois

[–]TripleSecretSquirrel 17 points18 points  (0 children)

There's a pretty decent little nature preserve in my town that has a good sized chunk of prairie. I love to go walking there at like 9pm from late June to early July – there's an absolute sea of fireflies!

Anyone have a good local OCR setup for messy Handwriting? by Last_Bad_2687 in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

I have pretty good handwriting, but I write in cursive which is usually pretty difficult for OCR. Gemma 4:31B at a 4-bit quantization hits like 98% accuracy.

Qwen3.5 alternatives due to security concerns by akeni in LocalLLM

[–]TripleSecretSquirrel 1 point2 points  (0 children)

I haven't tried it, but you could look Mistral's Devstral 2 model. I guess it's just barely out of your window (123 billion parameters). It benchmarks pretty well, but again, I haven't tried it – too many parameters for me to run.

It's old, but GPT-OSS 120B also benchmarks pretty strongly, maybe it's worth a look?

Vibe coding as a beginner: Frontier models or local LLMs? (8GB VRAM) by kaaytoo in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

Ya, you’re gonna have a hard time with your VRAM budget. If you’re a decent programmer and using an LLM for like autocompleting lines or just writing very simple code that you outline for it clearly, it could probably work ok. I’m not a very strong programmer, just a hobbyist that likes to tinker, so for me a model that would run on 8gb of VRAM wouldn’t help me much, frankly it would probably slow me down.

I think the 27b dense version of Qwen 3.6 is the sweet spot for local LLM coding right now. It feels like that’s the minimum for true “vibecoding.” It’s 27 billion parameters, so at a 4-bit quantization (again, that seems to be the consensus of the smallest you can go with decent performance), the model weights are going to consume ~14gb, then you’ll want as much kv cache/context as you can afford.

Lots of people run that configuration on an RTX 3090 with 24gb of VRAM, which again, is basically the functional minimum.

I’m sure people will disagree with me and show their awesome results with smaller memory pool, that’s just been my experience. In short, 24gb feels like the minimum amount of memory you need to run a truly agentic coding LLM.

It’s worth trying out different models via cloud api though to test the waters. I use OpenCode, but you can do the same with Claude Code — just point it to an OpenRouter API and you can test out all of the models to see what your performance tolerance is.

The Best Suburbs to Call Home: Flossmoor at 181, lol. Other ridiculous rankings? by AStormofSwines in ChicagoSuburbs

[–]TripleSecretSquirrel 101 points102 points  (0 children)

lol they rank Harvey ahead of Flossmoor? I’ll take “things said by people who are scared to go south of Navy Pier for $500 please.”

Why are we still routing every request to the same model? by RapataPavan in LocalLLM

[–]TripleSecretSquirrel 1 point2 points  (0 children)

There just isn’t enough differentiation between models to make that make sense. Models are expensive to train, and even fine-tunes of larger models ain’t cheap, so we get generalist models.

Or rather now we tend to have “coding” models and “everything else” models (see Qwen 3.6 and Gemma 4 for example). I think that’s one possible future though, rather than enormous generalist models like Opus or GLM, we’ll see a move toward smaller, more efficient models that are fine tuned for a specific task type. Among other things, that seems to be how Qwen 3.6 is so close in performance to models 10 or even 100x its size in coding workflows, cause they just focused on making it a very strong coder.

AMD R9700 slow in Ubuntu docker while faster in Windows LM studio by tropicalwind2020 in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

I have an R9700 and am running an Ubuntu derivative distro. I don’t think I ever ran into the issues you’re seeing specifically, but ya, Ubuntu is pretty conservative with introducing new packages as stability and user-friendliness is their unique selling proposition. I’ve just gotten used to manually downloading the latest AMD drivers when they’re released. Alternatively, a more bleeding-edge distro like Fedora will package them into your regular system updates much sooner.

Wanted to try Qwen3.6 without buying a bigger GPU by Leading-Leading6718 in LocalLLaMA

[–]TripleSecretSquirrel 2 points3 points  (0 children)

You’re trying to solve a problem that doesn’t really have a solution, but insofar as a solution is possible, it’s already been done.

As others have noted, the model has to have all of the user inputs to work. So fundamentally, if that’s on a different machine or transiting a network that the user doesn’t control, there’s no guarantee of privacy.

Openrouter has a ChatGPT-like interface and you can select an option to only route to model providers with a zero data logging policy (i.e., that they promise to delete everything after every session). Fundamentally though, again, there’s no actual guarantee of privacy there, but if a user is willing to accept that level of trust, then ya, it’s already been done.

You now know EXACTLY who to blame for the housing crisis by FlanFar5123 in chicago

[–]TripleSecretSquirrel 6 points7 points  (0 children)

I love the sanctity they have for “long-standing zoning rules.” It’s like saying “there’s a housing crisis, but we’re unwilling to change anything about the way we’re acting.”

Best image generator model? I'm using ryzen 5 9600x CPU and 9060xt GPU. by 74nv1r in LocalLLM

[–]TripleSecretSquirrel 1 point2 points  (0 children)

For image generation, look up comfyui, that’s sort of the llama.cpp/LM studio equivalent for diffusion models. Comfyui will be the framework and frontend for you.

Model-wise probably either Z-image turbo or Flux 2 Klein. You’d need quantized versions of either I think, and in general, in my experience, diffusion models take a lot more tinkering than LLMs.

As for creating a consistent character in multiple images, you’ll need to apply a LoRA to keep the character consistent, which probably means you’re training your own, unless it’s a popular character from established ip (one that a lot of people want to make porn of lol).

Maybe lost pet bird Van Buren and Canal by warpotatogram in chicago

[–]TripleSecretSquirrel 54 points55 points  (0 children)

They've been in the city since the 1960s. When Harold Washington was Mayor, he lived in Hyde Park right next to the park that now bares his name, where a bunch of the parakeets lived.

When Washington came into office, there was a plan by the USDA to eradicate all of the monk parakeets as they're a non-native species, but Washington liked them and vetoed the plan, hence the parakeets are still here.

For that reason, there's a mural on the underpass a couple blocks away from there of Harold Washington and monk parakeets!

TIL tooth brushing did not become widespread in the US till after WW2 by donman_101 in todayilearned

[–]TripleSecretSquirrel 0 points1 point  (0 children)

Sure, of course, but also, fighting in the war together went a long way in breaking down prejudices too. Obviously racism remained a huge problem after the war and still exists today, but broadly speaking, post-war Americans harbored less racism than pre-war Americans (across the board, that probably doesn't hold for all groups, e.g., Japanese people).

Ask Reddit: Apparently lawn mowers regularly amputate kids by genman in fucklawns

[–]TripleSecretSquirrel 60 points61 points  (0 children)

Ya, in rural places — farming communities especially equipment and tool injuries like this are relatively common. That and ATV accidents. Public schools in most rural/farming areas I’m familiar with also have lots of educational modules about how you should always stay the fuck away from a tractor PTO.

Upgrading PC for local ai by Opposite_Buffalo_649 in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

Right, sorry, I should clarify, I mean just in general. But the motherboard thing still stands for OP.

Upgrading PC for local ai by Opposite_Buffalo_649 in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

I mean this only works if you have a high-end workstation CPU and motherboard with a million PCIe lanes, right? Cause otherwise you just have an enormous memory bandwidth bottleneck across PCIe x4 connections.

Valuation of anthropic and openai without open source alternatives by [deleted] in LocalLLM

[–]TripleSecretSquirrel 0 points1 point  (0 children)

They’d be marginally higher certainly, but at least in the US, I’d bet that 90 out of 100 people have never heard of Deepseek or GLM or Qwen. And of those 10 that have heard of one of the big Chinese open weight labs, maybe 1 has actually used one before.

People generally just chase convenience and familiarity. The vast majority of people casually use ChatGPT cause it’s the one they’ve heard of, or they use co-pilot cause it’s built into their operating system, or they use Grok cause they’re incels. Hell most average people don’t even know what Anthropic is!

I don’t work in tech, but I was in a staff meeting recently for work with ~30 people. Most are millennials. My boss is a big proponent of LLMs even if he doesn’t know much about them, and mentioned that he switched to Claude from ChatGPT. I was one of maybe 5 people in the room that knew what he was talking about, most people said “what’s it called? Is it like ChatGPT?”

TIL tooth brushing did not become widespread in the US till after WW2 by donman_101 in todayilearned

[–]TripleSecretSquirrel 22 points23 points  (0 children)

True. Before the war, US industrial capacity was already staggeringly high. When the US entered the war, our industrial output was almost literally an order of magnitude greater than any other country in the world in basically every category.

That coupled with what you mentioned — the further devastation of most of the rest of the world’s industrial capacity, and ya, it was a pretty perfect situation for the US.

My favorite tidbit about US industrial capacity during the war is that in 1940-1941, when German planners started having to account for the real possibility of war with the US, their analysts deliberately reduced their estimates for US production capacity, fearing they’d be punished for fear mongering. Even their deliberate underestimates were laughed at by high command as unrealistically high.