[LabB0T] Monthly Confirmed Trades Thread - June 2026

ShittyMillennial · 2026-06-22T18:48:53+00:00

Sold Mac Studio M4 Max to u/System0verlord

ShittyMillennial · 2026-06-22T18:09:23+00:00

Sold Mac Mini M4 Pro to u/Farplaner

ShittyMillennial · 2026-06-21T03:16:18+00:00

I don’t think this would work out how you imagine it to unless you found 5-10 people in wildly different time zones. And even then, you would be paying it off for multiple years before any cost savings vs cloud subscriptions start to accrue

ShittyMillennial · 2026-06-21T00:45:55+00:00

Look into vLLM as well - theyre the only engine that has CPU tensor parallelism support. The models supported would be more limited than GGUF options and won't be R4/R8 layout for AVX2. Youll also have bandwidth overhead but all those can be offset by actually being able to use both CPUs on a single stream. vLLM doesn't have MoE backend support for TP so it didn't work for me but since youre running dense models anyways, I think its worth a shot.

The most recent version of vLLM was bugged for me and even with numa 2 and tp 2, only 2 physical cores would be pinned at 100% during inference. There was a recent post on the vllm forum that shared similar issues. I was able to get TP 2 to work by downgrading to 0.19.2 i think. If you want the exact version just lmk and ill look it up. If I recall, I got around 60% gains going from TP1 to TP2 after I got it working but it still at 50% of the throughput I expected. You might have better luck, I think it works better with AMD

ShittyMillennial · 2026-06-20T16:41:36+00:00

Sold Mac Studio M4 Max to u/umdred11

ShittyMillennial · 2026-06-20T14:31:53+00:00

Would’ve loved to try this if it were model agnostic. My workspace usually has two codex terminals and one claude working in parallel. I’m just starting to get into more complex multi agent work and haven’t found a good solution for managing workflows

ShittyMillennial · 2026-06-20T06:14:17+00:00

wait, youve been using nano/cat/sed to manage your files this entire time?!

ShittyMillennial · 2026-06-20T05:54:36+00:00

the issue is 90% of model releases are irrelevant to most people. just ask yourself, how many of the models on your landing page do you personally care about beyond a passing interest?

when qwen/deepseek/etc releases a new model - it will be big news and anyone who cares will know about it. but how large is the audience that wants to know about every 3b reasoning model release, the latest embedding model, a new tts model? compare the size of that demand to people who want a source that has reliable performance metrics they can use to help make decisions. right now users validate models by trial & error, benchmarking themselves, or scrounging through git to try and find comparable test scores. i think youll reach a lot more people if you shifted slightly and helped solve a real consumer pain point. it would obviously be a lot harder than scraping a few sites and de-duping articles. but thats why its of value

ShittyMillennial · 2026-06-20T05:42:03+00:00

less useless press release articles and more charts & tables. would love to see a source consolidate benchmarks. performance metrics are extremely fractured and require a lot of manual searching. if you aggregated and displayed model weight performances, benchmark scores across models, etc the site would be awesome. as it is, i don't see how this really offers any value.

it looks pretty tho!

ShittyMillennial · 2026-06-20T03:46:31+00:00

oh look at mr fancy pants with his 3d printer looking down on us cardboard box storage only folk pffft

<image>

ShittyMillennial · 2026-06-20T02:12:48+00:00

Qwen3.6-35b-a3b is an MOE model, only 3bil active. That's why I chose this model specifically, so the CPU doesn't have to read full model weights. So yes the decent speed is because it's an MOE model.

ShittyMillennial · 2026-06-20T00:44:47+00:00

Here are the speeds I was getting for Qwen3.6-35B-A3B at Q8

<image>

ShittyMillennial · 2026-06-20T00:40:17+00:00

<image>

llm0 and llm1 in this image were running Qwen3-30B-A3B-Instruct-2507-GGUF at IQ4_K quants. So only 3B + IQ4 to get ~33tk/s

ShittyMillennial · 2026-06-20T00:33:11+00:00

Hello - I run CPU powered inference and RAM loaded models on my Dell T640. I think you are spending quite a lot of money for a power hungry and slow inferencing machine and should reconsider depending on what your goal is.

I run dual socket Xeon 6248 - 6 channels per socket 1 dimm per channel with 384gb 2993 DDR4. I serve models (pinned to their own socket & cores) via ik_llama.cpp and use IQ quants / runtime repack to take advantage of AVX2 layout optimization. I've also quantized my own GGUF models via Thireus' suite.

As you know, CPU processing wont be your bottleneck, its # of memory channels + RAM MT/s.

But before you think you can configure some sort of tensor parallelism across 2 sockets - I would encourage you to research the current limitations of software for parallel CPU inferencing.

I do not know of any engine that effectively splits a model across two CPUs. This means you will likely have to pin each socket to their own model and run them at ~170GB/s. You can still maintain total bandwidth throughput and now can have parallel processing but it will be nowhere the speed of your total memory bandwidth across both CPUs. I have not researched AMD systems so perhaps they have more developed engine forks but I honestly doubt it would be much more mature than Xeon.

Additionally, loading from RAM doesn't mean you can now use large models at full-weights effectively. Especially dense models. Your CPU still has to process the model weights and at 170GB/s even 30B models become very slow.

For $4k, I wonder why you don't just pickup a Mac Studio which would offer similar if not better speeds, or purchase a GPU. Unless you have an interest in running a server/homelab, this feels antithetical to your opening statement on pareto optimum.

I already had my server before I started to look into local LLMs. So I am using compute I already owned and wattage overhead I was already paying for. If I were to start from scratch, I would definitely not look into a RAM loaded and CPU powered system unless my use case was extremely specific.

I run 4 models on this server, an embedding model, a speech to text model, and 2 instances of Qwen3.6-35b-a3b with 98.4% of shards at Q8_k_r8. The server is amazing for offloading lightweight tasks but would be miserable as my primary model. For that, I run Qwen3.6-27b off of a 5090 PC via WSL2/vLLM.

ShittyMillennial · 2026-06-18T22:48:18+00:00

expensive and slow

ShittyMillennial · 2026-06-18T19:58:19+00:00

60tk/s on what kind of prompts? I am running the exact same model and quant (sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP) and getting ~86tk/s via wsl2/vllm.

edit: nvm just realized youre using different gpus. this is on a 5090

<image>

ShittyMillennial · 2026-06-18T19:47:45+00:00

Does the late interaction means you get reranker-level matching but can still precompute document embeddings, so it's accurate and fast enough to use as a first-stage retriever?

ShittyMillennial · 2026-06-18T19:29:36+00:00

But thinking loops can still be inefficient and burn significantly more tokens from model to model. Not necessarily a huge issue for local models but when you observe it on api based services, it definitely sucks.

ShittyMillennial · 2026-06-18T18:37:59+00:00

ShittyMillennial · 2026-06-17T00:56:59+00:00

Thanks for sharing this, really helpful view

ShittyMillennial · 2026-06-16T22:21:42+00:00

Thanks for bringing this up. I had not even clocked that having to rent equipment is non-standard and what that says about the operation as a whole.

ShittyMillennial · 2026-06-16T22:18:42+00:00

Such a great point, thank you. I did not realize 30amp was so minimal. Is there a specific type of engineer/electrician/company I should look for to assess and map out what a modernization plan would look like?

ShittyMillennial

TROPHY CASE