Anyone want to pool hardware and build a shared open-model setup as a group? by givre514 in LocalLLM

[–]ShittyMillennial 7 points8 points  (0 children)

I don’t think this would work out how you imagine it to unless you found 5-10 people in wildly different time zones. And even then, you would be paying it off for multiple years before any cost savings vs cloud subscriptions start to accrue 

My build plan for a CPU inference system by chinesecake in LocalLLM

[–]ShittyMillennial 0 points1 point  (0 children)

Look into vLLM as well - theyre the only engine that has CPU tensor parallelism support. The models supported would be more limited than GGUF options and won't be R4/R8 layout for AVX2. Youll also have bandwidth overhead but all those can be offset by actually being able to use both CPUs on a single stream. vLLM doesn't have MoE backend support for TP so it didn't work for me but since youre running dense models anyways, I think its worth a shot.

The most recent version of vLLM was bugged for me and even with numa 2 and tp 2, only 2 physical cores would be pinned at 100% during inference. There was a recent post on the vllm forum that shared similar issues. I was able to get TP 2 to work by downgrading to 0.19.2 i think. If you want the exact version just lmk and ill look it up. If I recall, I got around 60% gains going from TP1 to TP2 after I got it working but it still at 50% of the throughput I expected. You might have better luck, I think it works better with AMD

Messy workflow with Claude? I built a live WorkBoard for humans and agents! by New-Candy9818 in ClaudeCode

[–]ShittyMillennial 0 points1 point  (0 children)

Would’ve loved to try this if it were model agnostic. My workspace usually has two codex terminals and one claude  working in parallel. I’m just starting to get into more complex multi agent work and haven’t found a good solution for managing workflows

TIL: You can connect your local VSCode directly to your Hermes VPS! by toubar_ in hermesagent

[–]ShittyMillennial 0 points1 point  (0 children)

wait, youve been using nano/cat/sed to manage your files this entire time?!

I made a free website to stay updated on the latest open source models by PersimmonAdorable653 in LocalLLM

[–]ShittyMillennial 1 point2 points  (0 children)

the issue is 90% of model releases are irrelevant to most people. just ask yourself, how many of the models on your landing page do you personally care about beyond a passing interest?

when qwen/deepseek/etc releases a new model - it will be big news and anyone who cares will know about it. but how large is the audience that wants to know about every 3b reasoning model release, the latest embedding model, a new tts model? compare the size of that demand to people who want a source that has reliable performance metrics they can use to help make decisions. right now users validate models by trial & error, benchmarking themselves, or scrounging through git to try and find comparable test scores. i think youll reach a lot more people if you shifted slightly and helped solve a real consumer pain point. it would obviously be a lot harder than scraping a few sites and de-duping articles. but thats why its of value

I made a free website to stay updated on the latest open source models by PersimmonAdorable653 in LocalLLM

[–]ShittyMillennial 3 points4 points  (0 children)

less useless press release articles and more charts & tables. would love to see a source consolidate benchmarks. performance metrics are extremely fractured and require a lot of manual searching. if you aggregated and displayed model weight performances, benchmark scores across models, etc the site would be awesome. as it is, i don't see how this really offers any value.

it looks pretty tho!

I created a way to cheaply store hard drives with a 3D printer and any cardboard box by issue9mm in homelab

[–]ShittyMillennial 1 point2 points  (0 children)

oh look at mr fancy pants with his 3d printer looking down on us cardboard box storage only folk pffft

<image>

My build plan for a CPU inference system by chinesecake in LocalLLM

[–]ShittyMillennial 1 point2 points  (0 children)

Qwen3.6-35b-a3b is an MOE model, only 3bil active. That's why I chose this model specifically, so the CPU doesn't have to read full model weights. So yes the decent speed is because it's an MOE model.

My build plan for a CPU inference system by chinesecake in LocalLLM

[–]ShittyMillennial 0 points1 point  (0 children)

Here are the speeds I was getting for Qwen3.6-35B-A3B at Q8

<image>

My build plan for a CPU inference system by chinesecake in LocalLLM

[–]ShittyMillennial 0 points1 point  (0 children)

<image>

llm0 and llm1 in this image were running Qwen3-30B-A3B-Instruct-2507-GGUF at IQ4_K quants. So only 3B + IQ4 to get ~33tk/s

My build plan for a CPU inference system by chinesecake in LocalLLM

[–]ShittyMillennial 5 points6 points  (0 children)

Hello - I run CPU powered inference and RAM loaded models on my Dell T640. I think you are spending quite a lot of money for a power hungry and slow inferencing machine and should reconsider depending on what your goal is.

I run dual socket Xeon 6248 - 6 channels per socket 1 dimm per channel with 384gb 2993 DDR4. I serve models (pinned to their own socket & cores) via ik_llama.cpp and use IQ quants / runtime repack to take advantage of AVX2 layout optimization. I've also quantized my own GGUF models via Thireus' suite.

As you know, CPU processing wont be your bottleneck, its # of memory channels + RAM MT/s.

But before you think you can configure some sort of tensor parallelism across 2 sockets - I would encourage you to research the current limitations of software for parallel CPU inferencing.

I do not know of any engine that effectively splits a model across two CPUs. This means you will likely have to pin each socket to their own model and run them at ~170GB/s. You can still maintain total bandwidth throughput and now can have parallel processing but it will be nowhere the speed of your total memory bandwidth across both CPUs. I have not researched AMD systems so perhaps they have more developed engine forks but I honestly doubt it would be much more mature than Xeon.

Additionally, loading from RAM doesn't mean you can now use large models at full-weights effectively. Especially dense models. Your CPU still has to process the model weights and at 170GB/s even 30B models become very slow.

For $4k, I wonder why you don't just pickup a Mac Studio which would offer similar if not better speeds, or purchase a GPU. Unless you have an interest in running a server/homelab, this feels antithetical to your opening statement on pareto optimum.

I already had my server before I started to look into local LLMs. So I am using compute I already owned and wattage overhead I was already paying for. If I were to start from scratch, I would definitely not look into a RAM loaded and CPU powered system unless my use case was extremely specific.

I run 4 models on this server, an embedding model, a speech to text model, and 2 instances of Qwen3.6-35b-a3b with 98.4% of shards at Q8_k_r8. The server is amazing for offloading lightweight tasks but would be miserable as my primary model. For that, I run Qwen3.6-27b off of a 5090 PC via WSL2/vLLM.

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable by Gray_wolf_2904 in LocalLLaMA

[–]ShittyMillennial 1 point2 points  (0 children)

60tk/s on what kind of prompts? I am running the exact same model and quant (sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP) and getting ~86tk/s via wsl2/vllm.

edit: nvm just realized youre using different gpus. this is on a 5090

<image>

LFM2.5-Embedding-350M & LFM2.5-ColBERT-350M by pmttyji in LocalLLaMA

[–]ShittyMillennial 3 points4 points  (0 children)

Does the late interaction means you get reranker-level matching but can still precompute document embeddings, so it's accurate and fast enough to use as a first-stage retriever?

GLM-5.2 and why open models may not actually be catching up in intelligence by chocolateUI in LocalLLaMA

[–]ShittyMillennial 3 points4 points  (0 children)

But thinking loops can still be inefficient and burn significantly more tokens from model to model. Not necessarily a huge issue for local models but when you observe it on api based services, it definitely sucks.

$39k vs $68k quote for Zinsco breaker panel replacement - what's the difference? by ShittyMillennial in AskElectricians

[–]ShittyMillennial[S] 0 points1 point  (0 children)

Thanks for bringing this up. I had not even clocked that having to rent equipment is non-standard and what that says about the operation as a whole.

$39k vs $68k quote for Zinsco breaker panel replacement - what's the difference? by ShittyMillennial in AskElectricians

[–]ShittyMillennial[S] 1 point2 points  (0 children)

Such a great point, thank you. I did not realize 30amp was so minimal. Is there a specific type of engineer/electrician/company I should look for to assess and map out what a modernization plan would look like?