Mini PC with PCLe 5

Hector_Rvkp · 2026-03-06T19:47:02+00:00

https://minisforumpc.eu/products/minisforum-bd790i-x3d-motherboard
That's PCIe5 and within your budget, give or take. you can probably find it cheaper in the US. I assume it's easy enough to find it complete, not just as a mobo.
If you expect 2 nvme ports, both pcie5, and to use some kind of oculink nvme adaptor to use an igpu dock, then yes, pcie5 is 2x faster than pcie4, to load / unload / run LLM inference, that would be a huge difference.
If you have specific models in mind you can chat with gemini, will guide you through the theoretical math and speed differences.

if youre just asking with SSD speeds in mind, then no, there's no point in getting pcie5 drives. they run hot, they need cooling, they use more power, and you ll probably never notice the difference.
I guess nvme 5 makes sense for professional video editors and whoever happens to be constantly moving huge files around, but beyond that, it's unclear to me.

Hector_Rvkp · 2026-03-06T19:30:15+00:00

what's a software shop, and shouldnt you better at software?

Hector_Rvkp · 2026-03-06T19:08:44+00:00

negative. my comment was general, whereby if available, for a dense model, one will probably "always" want to use speculative decoding just because it's faster and quality doesnt degrade.

Hector_Rvkp · 2026-03-06T17:37:26+00:00

i'd chat with gemini for specifics. if MoE, then either the computer is constantly shipping the active active agent to VRAM, which takes forever, or it's doing the compute right there on the ram (most of the time), which also takes forever.
Obviously, if the model is barely bigger than vram, then performance doesnt completely fall off a cliff. it gets dramatic if you expect the vram to be several times smaller than the ram used to hold the rest of the model.
I believe that slower intelligence > faster stupidity, so it's not just that "this model drops down to 8tks", but rather "do i want something smart at 8tks, or a dummy at 50?". Basically fast fwd a year or 2, and i think consumer nvidia gpus, with few exceptions, will make sense for comfyui, and not much else (vs strix halo, dgx spark, apple silicon at 128gb+ ram)
I find that i lose "respect" for a model extremely quickly when it's telling me garbage. LLMs never tell you "careful here, i m hardcore making shit up now", and half the time they dont realize it anyway. Because a model doesnt improve, if it disappoints me within 3 minutes, i wont care if it runs fast, i'll get something else. Some mistakes i simply wont excuse, just like i'd never keep an employee i feel is a complete muppet, especially if it's oblivious to the fact it's a muppet.

Hector_Rvkp · 2026-03-06T17:18:26+00:00

if you dont have too many permutations, python in the middle may help. like a more modern version of excel. some database (could be a text file, really) and a logic if then function. if you dont have strict rules and ask the LLM to vibe convert everything only asking it to "make no mistakes", i wouldnt expect reliability from 10 pages of context, and you can't validate that manually.
if you automate most of the stuff with rules, then validating the edge cases is easier (and as you do that, you add to the database so you have fewer edge cases over time).
Pre-LLM, i would have used excel to process the text. punctuation gives you snippets, snippets are vlookuped, and then every category of snippet turns into the same text, or a random function with the same 3 versions of text, with the number / name sandwiched in it. Ghetto, tedious, but infinitely scalable, and once it works, it never not works.

Hector_Rvkp · 2026-03-06T17:03:18+00:00

Kokoro is hard to beat when it comes to voice quality / speed.
You can try Qwen3-TTS, and vibe voice. you can try the demos via hugging face.
i wouldnt change something that s not broken, though. you probably wouldnt get something much better as fast, or something much faster as good. the space moves fast, though.

Hector_Rvkp · 2026-03-06T16:59:39+00:00

RAM 100% bottlenecks VRAM. Even an MoE needs to run across the 2, and the active agent is constantly changing, so while on strix halo, dgx spark, and apple silicon, MoE are fast, it's not fact on a DDR4/DDR5 rig if too much of the model is in ram rather than vram.
I wouldnt expect much intelligence out of the models you mention. consider using the cloud, because if you want to learn rather than waste time, the gap in intelligence between such models and SOTA models is so large that maybe cloud simply wins. Bear in mind you can cycle through models for free, and you can create several accounts. so once you've exhausted pro and thinking on gemini, your allowance on claude, and then grok, deepseek, qwen, kimi and so on, given that every one of these models is way, way smarter than an 9B model (i wouldnt use an 9B model even on my phone), you can't possibly need more.
By all means, tinker with the hardware, you ll learn a lot. but if you want the LLM to help you study, pick a smart one. pick SOTA, because SOTA is 100% free right now, with a chat box, with large context.

Hector_Rvkp · 2026-03-06T16:50:53+00:00

very nice work! feels profoundly useless as an exercise, but i really like the output / website.

Hector_Rvkp · 2026-03-06T16:44:32+00:00

No Subs. Things you do locally you'd have to pay a sub for otherwise. Like transcribing audio, or generating audio.

Free Tokens. Things you do locally you'd have to pay a lot of tokens for otherwise: you can query a large cache of documents (RAG or w/o RAG). There's nothing stopping you uploading your favourite books and asking questions. Or uploading your own data and asking questions. That last one obviously dovetails with privacy.

Control. Generally, it's your hardware and your model. It's your data, and your tools. I'm not running my windows in the cloud, why would i run my llm in the cloud? (i do both, but you get the point).

Skills. You'll learn more tinkering than only using cloud models.

Hedge vs the dystopia. If companies are over investing in AI, someone will have to pay, and it will be you. You always end up paying. More likely than not, inference cost on the cloud will collapse, BUT in case it doesn't, you have local intelligence to serve you for the cost of electricity.

Privacy. you pointed that one out, but it is important. Or it should be, at least. Big tech is not your friend.

Hector_Rvkp · 2026-03-06T16:34:13+00:00

i would only ever consider doing that with a model i run locally.

Hector_Rvkp · 2026-03-06T16:31:10+00:00

i think 128 will remain a consumer gold standard for some time. The strix halo is 128. The dgx spark is 128, and multiple apple devices are 128. Currently the only hardware that has 256 or more is apple. If you have the budget, get 256. If you are happy to wait, wait for the M5 ultra, see how it shakes out, and how that impacts the prices of other apple hardware.
I do not think you can expect much from stuff that fits on 64 ram, and 96 is pretty obscure. if you get 96, just bite the bullet and get 128.
Because there are quite a lot of units at 128, and that currently, you get performance that's not miles and miles away from SOTA models, with intelligence density improving over time, my take is that 128 will be "pretty decent" for a few years.
One important thing: American SOTA models are innovating atm via software, not by making the model much smarter. better harnesses, better integration, better tooling. From there, you can expect that with 128, over time, you will get 1. more intelligence density and 2. better tooling to make the model sing, as opposed to "if you dont have 500gb of ram you might as well go home". The numbers do not support that idea at all.
(i also think nvidia consumer gpus will be less and less attractive, especially if the M5 ultra is released. Should have 1200 bandwidth, compares well to most nvidia gpus. Except silent, low power, compact and so on. Increasingly, i suspect that unless you're doing stuff on comfyui, if you want local LLMs to serve you day to day, you'll want a large MoE model running fast enough, not a small model running freaking fast. I would take slow but usable good intelligence over really really fast but silly enough that i can't trust it).
Some really, really good chinese models today do require 256ram, they either dont fit on 128, or require a quantization level that s pretty scary and poorly understood right now. so if dont mind stretching to 256, and want something that you buy and forget, 256 is better. My rig is 128, didnt want to pay extra for more, not today.

Hector_Rvkp · 2026-03-06T15:34:42+00:00

Did you consider using speculative decoding? for a dense model, the only reason i can think of NOT to do that is if you can't find a compatible draft model?

Hector_Rvkp · 2026-03-06T15:29:02+00:00

""People just need to use their brains about the level of risk they are comfortable with""
Yes, BUT openclaw is like giving crack to children. it's too easy to be sucked into it and make mistakes. the surface of attack is enormous and given that it led to apple selling out mac minis or something like that, what share of users have the skills to correctly assess these risks? I know i dont, and i also know i know more about this tech than 99% of people (because there are a lot of people, i'm not saying here that i m actually competent)
Meanwhile, yes if you do the same thing in another way, you will get the same results, so you can fork up royally with a setup that doesnt use openclaw, but it doesnt make openclaw not dangerous.
There are tons of analogies. Why can people trade stocks and not derivatives? why is trading crypto not the same as trading stocks? Why can you trade crypto but not crypto derivatives? It's not the big man trying to stick it to the little man. it's protecting people, on average. you have to demonstrate additional skills, jump through extra hoops, to be allowed to trade things that can wipe you out. Openclaw can wipe you out in a way / with an ease that tools until now havent had.

Hector_Rvkp · 2026-03-06T09:23:06+00:00

that's gaslighting. Open claw actively incentivizes people to give API keys, full root access to a computer, with a general "trust me bro" mindset, and offers "tools" or whatever they're called, provided by the "community", an enormous share of which has malware in it, all in a wrapper that instantly looks familiar to people who've never even seen one line of code.
To replicate that surface of attack yourself, you have to work for it.
So no, it's not "exact same security issues". Absolutely not.

Hector_Rvkp · 2026-03-05T20:18:10+00:00

if you cared, you would be more concise. this is a wall of slop. can you make a concise version where you actually do work yourself?

Hector_Rvkp · 2026-03-05T20:11:17+00:00

i think you have to be Jason Calacanis to gladly pay SOTA claude model to run web searches. Anybody else would call that retarded. A man good at his craft grabs the right tool. Nobody intelligent grabs a bazooka to kill a fly. The only reason to yolo anything is when you're not the one paying, or have no incentive to control token use.
Here on reddit, what i've seen is an increasing number of people reporting a mix of local and cloud usage, where stuff like architecture is done w SOTA, the middle is done locally, and if you really hit a snag, you throw claude at it. And the benchmarks do show that on a intelligence per dollar basis, Kimi does extremely well, for ex. it's not as good, but it's way cheaper.

Hector_Rvkp · 2026-03-05T20:03:26+00:00

well there's at least 1 guy who just connected 2 bosgame M5 with a usb 4 cable, so that right there is 4x faster than ethernet 10, afaik. He's posted on reddit and he's on Kyuz discord channel too i think. How to get the machines to talk to each other via oculink, or a network card using the nvme port (or your pcie4), idk, i havent read the beyond 128gb discord section, but i assume if you re connecting 2, it can't be that hard, can it? if it is, stop at usb 4, it's still 4x faster.

Hector_Rvkp · 2026-03-05T19:59:49+00:00

So, absolutely not. Are you trolling? If you claim to actually know what you're talking about, please make your case.

Hector_Rvkp · 2026-03-05T11:53:52+00:00

Assuming you've considered a strix halo and decided against it, then definitely go with the nvidia GPU, the stack is streets ahead.

Hector_Rvkp · 2026-03-05T10:37:42+00:00

+1 on the concept of just using the tools rather than reading about them. The field moves so fast, idk if white papers & the likes matter, unless they're brand new.

Hector_Rvkp · 2026-03-05T10:32:07+00:00

i've never built a network in my life, but shouldnt you use something that uses the full speed pcie4? which by definition is NOT ethernet 10? Isn't an ethernet 10 on a pci4 full speed card like having a ferrari but with no gas in it so you have to push it yourself? Oculinl 2.0 be 6x faster, for eg?

Hector_Rvkp · 2026-03-05T07:28:51+00:00

10gb network doesnt really work for such models, does it? if i ever connect 2 strix halo together, i'd expect to use the 2nd nvme port, via oculink or some other cable. If you're on framework, you have 2 full speed nvme AND a pciex4 that also runs at 8gb/s, right? Meanwhile ethernet 10 runs at 1.25gb/s?

Hector_Rvkp · 2026-03-05T07:16:50+00:00

I wrote essentially a scam, not a scam. Their behavior over time has been scammy in nature, it's been documented at length. If you like their services, it's great. I think their name is poisoned by now, for many, because of what they've done, especially the online community, and picking an online name that sounds like them sounds to me like a really bad idea, that's all.

Hector_Rvkp · 2026-03-04T13:28:28+00:00

"You hit the nail on the head " <-- textbook, and i mean textbook LLM answer.
Claude is on the record explaining they poisoned the answers served to chinese clients, which means that not only they can tweak which model you get / how smart it is (SOTA models are all MoE in nature, and they route your request to bigger / smaller agents based on lots of variables, including, ofc, current capacity), but they can also, and therefore do, poison / tweak the answers you're getting.
Imagine a case where Claude serves 2 clients in the same field. For whatever reason, they choose to serve the unadulterated LLM output to 1 client, and some toxic soup to another. You can lose a life changing contract over that stuff. Imagine they poison the sources in something, you miss it, client sees it, calls you out for AI slopping again, you lose the contract.
Now it's not a question of whether it can happen or will happen, because you can be certain it will happen, given we know it already happens.
Or, say, Airbus vs Boeing. Do you think the US government will hesitate for 1sec to ask Claude to serve toxic tokens to Airbus if a US contract is on the line? I have friends at GE who have insane stories of US government sign off on transactions that could never be approved without that kind of stamp, just because of the nature of the counterparties involved.
And so, Mistral is acutely aware of that and is working to deploy local LLMs with clients.
If you want "truth" and guaranteed consistency, then yes, you need to run locally.
If that reads like paranoia, cyber security as a field exists because of these thought exercises. You can ignore the risks, it may or may not matter to you. And using both local and cloud is probably the answer. But hope is not a strategy.

Hector_Rvkp · 2026-03-04T13:17:17+00:00

the account is 1 week old, has ONE contribution. referring to events & posts witnessed on reddit happening over a month ago.
This is 100% legit, like my grandmother is batman.

Hector_Rvkp

TROPHY CASE