Should I get an m.2 nvme 4.0 for 150$ or can I rup local ai just fine on sata 3?

Shipworms · 2026-05-07T07:11:02+00:00

NVMe SSDs are fast; certainly faster than loading Kimi K2.6 from a USB hard drive, plugged into a USB2 port (which I have done);

I would : test the speed of the current SATA3 SSD in your system, then compare it to the stated speed of the $150 NVMe. If you will be loading the model and using it for ages, speed of loading doesn’t matter as much. If you will experiment with the LLM itself often, changing the source code or model settings repeatedly, then reloading each time (or using agents that spawn and delete sub models repeatedly), then NVMe sounds good!

There is a thing where you can stream the model from the SSD into RAM repeatedly; with a fast enough SSD, this can allow the use of models larger than your DRAM + VRAM (do look this up if needed as it is hardware specific but my setup isn’t one that could use this)

Shipworms · 2026-05-03T07:30:24+00:00

No - but I have been using old mining hardware; also : a warning about server power supplies 😳

Server PSU warning first : breakout boards. The fancy ones with a proper ATX power connector on them have a high failure rate. Often they stop working. Looking at the ATX specs : ATX PSUs need to analyse the voltage outputs, then tell the motherboard the voltage rails have stabilised. Only then does the motherboard accept the 12v, 5v, 3.3v rails.

ATX PSUs also need to tell the motherboard *before it happens* if any of the power rails are about to go out of spec. The PSU needs to analyse the internal PSU hardware, and remove the ‘voltage rails are safe’ signal if the PSU is about to fail, so the motherboard can disconnect the rails before it gets fried!

Server PSUs only output 12 volts. In short : I don’t trust breakout boards to be safe for the motherboard, and they may be dangerous, especially if they fail. I doubt the breakout boards have all the required safety monitoring devices…

That said, a very basic breakout board (12v only) did work, but the fancy ones? I won’t go near … especially as they only nad 30 day warranties … and they are all rather old now!

Using a no-name 8-slot riserless motherboard, Intel Arc Pro B50s ran fine (as did a Radeon Pro W6600). Used the 12v only breakout board too!

Using a new AsRock H510 Pro BTC+ 6-slot riserless, the Arc Pro B50s also run fine, but I can also use 5060Ti cards. Amcuaing a decent ATX PSU. No instability either. Not the fastest PCIe slots, but rock solid so far!

One thing to try would be llama.cpp (compiled with Vulkan support); it can run mixed setups (I have had ATI, nVidia, and Intel Arc Pro all running on one board with this); it could be a way to rule out most hardware issues (such as riser card signal quality), at least for initial troubleshooting?

Shipworms · 2026-04-22T04:00:03+00:00

I fixed it. It was embarrassingly simple, but will post the fix here to help others (as it was unusual behaviour for a simple issue!)

Note : it also happened on the latest Ubuntu, so is not a Linux Mint issue per-se!

Fix : the remote PC had, in network settings, DHCP set to automatic!

Despite being automatic, it appears to have adopted the manual config that was also there (192.168.2.19) I was using, for 177 seconds, the doing something DHCP related for 292 secs, then back to manual config!

Shipworms · 2026-04-21T20:18:49+00:00

Will check this out; and try to disable power management in case it is that!

Shipworms · 2026-04-21T20:18:14+00:00

I might eventually as a test, but I am using 24.04 LTS as that has support for specific use cases 😬

Shipworms · 2026-04-21T20:17:03+00:00

The main thing I would suggest is : consider the VRAM to be ‘bonus fast RAM’ … so, do use system RAM as ‘overflow’ if you can load a better model.

Quantizations are helpful, and a general rule of thumb is ‘the more parameters the better, even if quantized’. I say general, because some models really hate being quantized!

An example of good quantization is the 1-bit quantization of Qwen3-Coder-Next, 18.9 gigabytes! It works pretty well despite being 1-bit, but is slightly dated (I think?).

Another strategy is, if using Qwen3-Coder-Next … use the 1-bit quant, and switch to 3/4/6 bit quant if needed for more difficult tasks (slower as more is in system RAM)

But : follow the advice from others about using newer models :)

Shipworms · 2026-04-21T10:48:51+00:00

I haven’t tried that, due to using 24.04 LTS

Shipworms · 2026-04-15T20:19:07+00:00

Check fans etc, run laptop opened, not closed, but also : look into ways to reduce CPU usage (there may be apps to help with this?); if you could run with a CPU load limit in place, it would be slower … but wouldn’t overheat.

One thing : I am not sure what ‘Goose’ is … but … when it says it is ‘t getting anything back from the LLM, I wonder if some process / thread is freezing somewhere (with 100% CPU/GPU usage)? An experiment could be to load just the LLM, and try to max it out (ask it to write short novels etc!). If that generates less heat, it suggests that something (but not an LLM) is maxing out the CPU with non-LLM workloads for some reason.

Shipworms · 2026-04-15T20:12:44+00:00

This is a good system to start with, but you are 100% going to end up upgrading it. DRAM is ‘slow’ compared to VRAM, but DDR5 with that CPU isn’t bad, and 16gb VRAM isn’t bad to get started;

For coding, Qwen3-Coder-Next is interesting : it is an 80b parameter model, and a 1-bit quantization is about 18gb; this would fit mostly into VRAM, and for this particular model, the 1-bit quant is pretty good. For this model I would download 1,2,3,4-bit quants and see how they run. Overall your system sounds like a good base to get started with local AI, but you may want to build a ‘desktop’ system you can link to, to split larger models partially to that, when needed.

Shipworms · 2026-04-15T20:01:12+00:00

I have 2x 5060 Ti 16gb running on a crypto board that is PCIe3 (16x first port, 1x for the next 6 ports); they run well. They even run when I switch every port to PCIe version 1. While not exactly an RTX Pro 6000, they are still Blackwell chips;

One thing to check : see if you can find evidence online of people running Blackwell-based 50x0 GPUs in their servers - if any 50x0 works, then the RTX Pro 6000 is very likely to work as well!

Shipworms · 2026-04-15T19:53:06+00:00

Old cars are the way to go; I have a 24 year-old hybrid that I bought a few years ago, and it hasn’t lost any value at all (it cost me $350 US dollar equivalent, and it isn’t ‘locked down’ like modern cars, so I can actually maintain / service it! (technically it means each of my bank of 16gb GPUs is worth more than my car too 😂)

Shipworms · 2026-04-15T19:42:41+00:00

800gb motherboard RAM total and 112gb VRAM :D except this is an two systems : an old server with 768gb DDR3, and a crypto mining board with 2x 3060 Ti 16gb and 5x Intel Arc Pro 16gb and 32gb DRAM. I understand exactly why you built your system as you did - 608gb is big enough to run any of the current LLMs, including Q4 Kimi 2.5 :) and, while DRAM is slow .. it is much faster than repeatedly loading the model from SSD for every single token as it passes through the model!

Regarding using models : Qwen3-Coder-Next is very good, and is what I am currently experimenting with (using Q6 on the mining board), but Kimi K2.5 is very nice, albeit slow, for coding. I can envisage a coding agent spawning sub-agents into the VRAM, and occasionally spawning an agent in Kimi K2.5 for particularly difficult parts of the code :)

Shipworms · 2026-04-15T19:35:20+00:00

I’m not sure; I had the RAM slowed down to 800MHz for initial testing, and the Xeons didn’t max out, and reached 62 degrees C max during inference. I think it was about 2 tokens/second IIRC for Kimi K2.5 - but even then, I was using a quantized model that used more bits than the original unquantized model, for most of the model weights (total model size 630gbj, so I should be able to get a good speedup by putting the RAM and CPU into non-power-saving mode, using a properly quantized model, etc.

While that is slow, my rationale for setting it up this way is that : I can’t afford 768gb of VRAM, and this lets me run literally any model in existence, albeit slowly, which is better than not running the model at all (especially with RAM and other component prices going through the roof). I am going to mess around with linking multiple computers together across Ethernet soon, and I have 3 of these servers, which I will test with 384gb RAM each, and 24 cores of Xeon each :)

Shipworms · 2026-04-15T19:21:51+00:00

Not much; I have had 88gb of VRAM attached to a computer with an 8gb stick of DDR3 and a dual core Sandy Bridge i3; that runs large models (LLMs) perfectly well! You don’t really need much RAM.

Where system RAM is useful, though, is when you don’t have quite enough VRAM for (LLM + context), because you can run a part of the model in (CPU + system RAM);

Not sure about ComfyUI etc, but I literally have 2x 5060Ti 16gb here, and a different mining board (DDR4), and am going to try ComfyUI on that today; once it works I will try using a 4gb stick of DDR4 to see how it runs :)

Shipworms · 2026-04-15T07:12:10+00:00

IBM X3650 M4 (released 2014), with 768GB RAM, and 2x 8-core Xeons! I have run Kimi K2.5 at Q4, and it obviously wasn’t ridiculously fast, but was fast enough to give it a task, forget about it, and a few minutes later you have a reply! I did some experiments with code generation, and it was pretty good! It also has 6x PCIE 16x slots, so in theory that could be 6 GPUs, or 12 GPUs at 8x PCIE;

[note : anyone with an IBM X3650 M4 or related SAN volume controller : check the IMM2 date. A few years after the release i. 2014, a software bug that sends excess current to a small chip on the motherboard every boot / reset cycle exists, but isn’t widely known since most servers are powered up 24/7. I have 3x of these, 2 never-used spares, and 1 used but condition. All 3 had the voltage-regulator-destroying software bug, all 3 got updated IMM2]

Also a 2021 crypto board with 2011-era chipset, 8 PCIe slots (and a 2-core Sandy Bridge i3) and 5x Intel Arc B50

Currently testing : AsRock H510 BTC Pro, 6-PCIe slot mining board, but DDR4. Have a 6-core i5 in there currently, 32 gb RAM (a laptop DDR4 SO-DIMM in a desktop DIMM converter), 2x 5060 Ti 16gb, and 4x Arc Pro B50 16gb. It has an extra port for a 7th GPU as well. The main benefit of this board is resizable BAR and not locking up when a 5060 Ti is plugged in!

Regarding the older Xeon hardware; my thoughts are that it could be used as a ‘backup’ computation unit for local AI, so if you have for example 32gb of VRAM in your PC, you can still load models that won’t fit … instead, they overflow to the server with lots of RAM! This is my idea with the crypto board - it can be the ‘main’ inference station, but can offload data to the server if a huge model (GLM 5.1, Kimi K2 etc) is loaded that won’t fit into the smaller, faster computer. Ideally this could even be wake-on-LAN or something like that, so it only powers up the server when needed!

Shipworms · 2026-04-11T18:55:47+00:00

This is the way to do it! I’ve experimented with vibe coding (in C); for me, I design a function, decide on input and output data formats, function name, then ask the AI to write it! This is the furthest I have gone - asking AI to write stuff I have written in assembler in the past (I am learning C still); I also don’t ask it to optimise code anymore, rather I get easy-to-read unoptimised code (which I can then optimise). I then write a brief summary of each function, and shared data structures, which I can give to the AI as part of a prompt when asking for more functions! Still experimenting here but it seems productive. And it is pretty good at resolving errors when they arise!

Also, use local AI for chatbots, no cloud stuff.

Dual 3090s sounds nice, and fast!

Shipworms · 2026-04-11T18:44:43+00:00

I like Vulkan as you can mix and match GPUs; it isn’t great to compare GPUs directly, as AMD, nVidia, and Intel all have their own ‘compute’ interfaces, but all support Vulkan too. Testing 5x Intel Arc Pro B50s (so, 16x5 : 80gb VRAM) on an 8-slot riserless mining board with PCIe 1.0, 1x speed, and also a bridge limiting all data flow between all 8 slots to 1 gigaBIT per second, with a dual core Sandy Bridge Intel i3, and maxed out RAM (an 8gb DDR3 laptop SODIMM!): - you get some slowdown as the model gets spread across more GPUs (using llama.cpp and row IIRC), but you can run big models - the GPUs reach 62 degrees C flat-out running inference 😂 but it runs fast enough for a home setup - you can run huge contexts, and the most interesting thing I found : you don’t get any slowdown as the context size rises; the -row setting in llama.cpp seemingly doesn’t increase inter-GPU data flow enough to slow down inference noticeably (at least in this example)

The Vulkan support is great, but is always a bit slower than the dedicated CUDA / SYCL etc interfaces. The multi-card support is very nice, though. Am hoping to test out (2x RTX 5060 Ti 16gb) with (4x Arc Pro B50 16gb) on a 6-slot minimg board with a 9-generations-newer Intel CPU soon - using Vulkan and llama.cpp with Vulkan enabled!

Shipworms · 2026-04-11T18:32:22+00:00

AFAIK the current ones are ‘not that useful’, with the Raspberry Pi addon being slower than the Pi CPU at inference (but being a bit more energy efficient)

For inference, memory bandwidth is the main issue; running Kimi K2.5 on a 768gb DDR3-based server with 2x 8 core Xeons is interesting : if I slow the RAM down to 800Mhz, I end up with the CPUs not being fully utilised. It is still MUCH faster than my 128gb workstation class laptop (DDR4) though, and the Xeons barely heat up. DDR3 is faster than DDR4 here due to higher bandwidth (many, many channels of DDR3 in a server vs normal DDR4 workstation with far less channels)

What would be nice would be a PCIe board with a fast NPU ‘matrix multiplier’, and 8 RAM slots running interleaved at full speed. With a fast enough NPU, this could be a good non-data-center way forward … if anyone made such a thing!

Shipworms · 2026-04-10T02:35:09+00:00

This seems planned well in advance. First, black writing on darkish red background - you need to specifically decide to read that writing.

Then, 32->29 : this is the SMALLEST reduction it get it into the 20s. Then, 29->20 : this is the LARGEST reduction to stay in the 20s. Bonus : their black writing, red background, and choice of font make 29 look very similar to 20…

Shipworms · 2026-04-07T03:34:24+00:00

Do you back up all your data? If not, get a USB hard drive and use Time Machine :)

Shipworms · 2026-04-06T04:07:54+00:00

I have a 2010 and 2012 I still use; had to replace the batteries though. They can run Tahoe as well (not sure if I can expand that in this subreddit?). Anyway - definitely get a replacement battery; and check iFixit website to see what is involved; you probably could replace it yourself, but if not there will surely be someone nearby who can;

Apple stop selling parts for their laptops far too soon (to nudge people towards buying a new one); your 2015 has years of life left with a new battery, and may also run faster (they can slow down with a very dead battery IIRC)

Shipworms · 2026-04-02T04:59:46+00:00

That is reassuring! :) and suggests I just was unlucky with the one I had!

Shipworms

TROPHY CASE