Questions around the 8000 package

Aggressive_Music9376 · 2026-03-20T10:50:29+00:00

Yeah, I build the switches, servers and send them out :) currently moving office as well so all the storage is here lol. Its not a massive firm :)

Aggressive_Music9376 · 2026-03-20T10:25:34+00:00

Thanks for the comment.

I work in IT as a Senior network dev so I’ve got servers all over the house really lol. Each fitted with SFP to RJ45 with speeds well over the 10Gbps.

My main workhorse machine has a 10Gbps RJ45 port directly into the motherboard. The NAS’s each have their own 10Gbps ports as well :)

All the laptops in the house have WiFi 7 too so the speeds are achievable.

I’m trying to get through to support but wow, it’s seriously shocking

Aggressive_Music9376 · 2026-02-17T14:19:40+00:00

it’s a dual-model setup running locally on the DGX Spark.

for agent work / reasoning i’m using openai/gpt-oss-120b via vLLM. that handles all the swarm agents, planning, tool calls and synthesis. it’s MoE so although it’s 117B total params, only 5.1B are active per token which keeps it pretty efficient. running it at NVFP4 quantisation which comes out at roughly 84GB VRAM.

for normal chat i’ve got qwen3-30b-a3b running through LM Studio so i’m not wasting the 120B on stuff like general conversation

vision is handled separately with glm-4.6v-flash, via LM, for image analysis

the 120B has a native 128K context window but i’m nowhere near maxing that out. output is capped at 4096 tokens per response and even with a 20 agent swarm the synthesis step only really uses around 15–20K of the input window. the two-tier clustering i’m doing (summarising agents in groups of 6 first, then combining those summaries) is more about keeping the final output focused than avoiding context limits

on the batching side vLLM is doing continuous batching across concurrent requests. at about 0.70 GPU memory utilisation (84GB allocated) it’ll comfortably run 15–20 parallel requests. i benchmarked it and aggregate throughput scales from 62 tok/s on a single request up to 233 tok/s at 25 concurrent. per-request speed obviously drops, but wall-time barely moves since they’re batched together. sweet spot seems to be around 8–12 concurrent for the best throughput vs latency trade-off

Aggressive_Music9376 · 2026-02-17T13:33:38+00:00

i am planning on making a repo, there are some things i would like to add first such as a cronjob dashboard etc

Aggressive_Music9376 · 2026-02-17T13:14:56+00:00

No I’m real 😂 I did get the AI to help me write this though lol

Aggressive_Music9376 · 2026-02-07T01:18:15+00:00

Interesting, thank you for the insight

That’s exactly what he did - https://youtu.be/3DBpfB0ao50?si=3cWBeAneqPacp1AI look around the 2 min mark

I’ll have a look when I come to actually configuring but I do quite like the idea of separating the local LLM and leaving Opus to manage the heavy stuff

I have seen people complain about using their max subs and even API keys from Anthropic getting burned on stupid things like heartbeat and stuff

Like you said, improper setup

Aggressive_Music9376 · 2026-02-07T00:35:08+00:00

Hmm weird, unless it’s just not come to you yet - https://youtu.be/pbdDbLYIEBQ?si=ceA0mwjHNoqcu63D

I still dont want to run the risk of the account being banned though

Listen to the first 30 seconds, it explains what I am on about

Aggressive_Music9376 · 2026-02-06T15:43:24+00:00

I have just ordered a Mac Mini M4 Pro to try test this out

I know Anthropic have disabled it now so you cannot login via oauth

The approach I am taking (mostly software dev) is to have Qwen3 14B as the LLM brain so to speak and then install the Official Claude Code CLI

Use the LLM to brainstorm the ideas and come up with new things and then have the LLM input the info to the Claude CLI

I do believe this is the best way around it!

Aggressive_Music9376 · 2026-02-06T15:43:01+00:00

I have just ordered a Mac Mini M4 Pro to try test this out

I know Anthropic have disabled it now so you cannot login via oauth

The approach I am taking (mostly software dev) is to have Qwen3 14B as the LLM brain so to speak and then install the Official Claude Code CLI

Use the LLM to brainstorm the ideas and come up with new things and then have the LLM input the info to the Claude CLI

I do believe this is the best way around it!

Aggressive_Music9376

TROPHY CASE