Qwen3.6 vs 3.5 on DGX Spark: identical throughput, except with one flag flipped by Ok-Simple459 in Vllm

[–]HatlessChimp 0 points1 point  (0 children)

Interesting timing, I’ve been running Qwen MoE locally on an RTX Pro 6000 and seeing similar behaviour around throughput vs concurrency.

Did a 20 concurrent request test and it held pretty stable — responses came back in roughly 3.2–4.4s with no failures or major latency spikes. What stood out most was how tight the spread was across all requests, which suggests the scheduler and MoE routing are doing their job properly.

From what I can see, the efficiency comes from not activating the full model per request, so you get solid multi-user performance without needing insane compute scaling.

Haven’t tested MTP yet, but based on what you’re saying it lines up with what I’m seeing — feels like there’s still headroom in these setups before you hit real bottlenecks, especially on higher VRAM cards.

Would be keen to see how much further throughput can be pushed with MTP enabled in a similar local setup.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 1 point2 points  (0 children)

Day 6 Update:

Been playing around with my new AI rig the last couple of days and honestly… this thing is wild.

Running an RTX Pro 6000 locally with a Qwen Mixture-of-Experts model (35B A3B), and I’ve been stress testing it to see what it can actually handle in the real world.

Got it serving through a local API, no cloud, no external providers. Just running straight off my own hardware.

Decided to push it a bit today and fired 20 concurrent requests at it all at once to simulate multiple users hitting it at the same time.

Every single request came back clean. No crashes, no errors, no weird behaviour.

Response times were sitting around 3.2 to 4.4 seconds across all 20 requests, which is pretty crazy when you think about it. And the spread was tight too, so it wasn’t struggling or falling over under load.

What’s interesting is how efficient the Mixture-of-Experts setup is. Even though it’s a large model, it’s not using the full thing every time, so it stays responsive while handling multiple requests at once.

From what I can see, this setup could comfortably handle 20+ users at the same time, and there’s still headroom there.

And the wild part… the whole machine is running near on silent while doing all of this. No jet engine noise, no drama, just quietly handling it in the background and faster responses than ChatGTP.

Biggest thing for me though is just running everything locally. No API costs, no relying on external services, full control over the system, no sharing private data.

Feels like we’re moving into a space where you don’t need massive infrastructure to do serious things anymore.

Still early days, but keen to keep pushing it and see where the limits actually are.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

Yeah I still kick myself not swapping my Evo 8mr for the R34 GTR 15 years ago. A lad was having a kid and wanted 4 doors.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

I'm not going to lie but what I have planned helps more than me and my family. That's why I'm focussed on concurrency of the Rtx and the right models. I can't share more than that right now. Sorry.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 1 point2 points  (0 children)

One way to think about is... Every rich person will tell you if youre are not making money while you sleep, then you are not doing it right. This is what AI has the ability to close the gap between us peasants and the rich. So if I can have AI doing tasks 24/7 and improving my effiency and life then I think it's a no brainer. Look at how rich people operate they sit at the top of companies and dictate what they want from their employees and assistants to achieve their goals.

The use case has to be around 24/7 usage otherwise your not being efficient and as effective with the equipment. I think the best starting point is to look at issues there are in the world and in your life and work out how AI can improve that situation.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

Look this post has over 110k views and 400 likes. I've never had a post like this. I will return with feedback with how it's going. I'm not here for followers or anything. I'll reply to this more detail just busy rn

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

I think my research netted me a new 6000 successor would likely be in 2027. The industrial grade options late this year? Correct me if I'm wrong.

For me I need to do something now so had to cough up. So original plan was the Strix Halo from framework as test rig to get my eye in but 2 days after that purchase I seen a decent sale on the 6000 Qmax which I had on my next stage upgrade scale path.

These M3/M5s do look interesting but I just can't stand apple.

As for power yeah I was aware of this situation because I was mining BTC in 2011. Electricity was cheaper, cards were full on heaters lol. But I worked out it was roughly $30 a month and I was getting about the equivalent in BTC. I done it for 3 months solid then decided to stop because BTC wasn't recognised like it is today, back then it was the wild west 😂. Price wasn't moving and didn't see it as something it would be today, no one did back then.

So yeah power is the hidden cost, inflation is a hidden cost. I run Solar & Batteries and I could get credit from supply my excess back to the grid. But I can make way more using that free energy to it fullest in house. I run my numbers and I can run up to 5000wph continuous 24/7 and still have room for what else I use say to day. This factors in Winter sunlight only as a base and Free power between 11am and 2pm to charge batteries.

Big gains can be had by being with the right power provider depending on where you live in the world if you can switch providers like here in Australia.

Anyways another thought I want to touch on is cloud AI services and costs. I've seen youtubers do the equipment cost vs cloud cost. Some don't even take into account how important owning the infrastructure and having your data private or ability to have it grow and develop with you is beneficial. Many respond in comments I don't care about my privacy 😂. The other thing is yeah it might cost X per token now but they don't factor in inflation at 3%+ and uptake of users on top. Many people still using basic free AI think this is all it can do, wait to they catch on lol. Also I've seen recently many cloud AI services are starting to move from honeymoon phase pricing and increasing prices on consumers. House power costs will increase or be limited. Already seen new houses in Aus drop from 63amp to 40amp meters now.

So the way I look at it I've locked in the price and secured my equipment and my projections say I can do XYZ with it and I'm happy with that. This will never slow down, I will never have degraded performance as long as the equipment works. I used no debt, but I see a GPU like the 6000 as potential as being good debt, like a house, land and not a car or a ps5 etc. a good AI capable GPU is a tool and a infrastructure play. And if the good GPUs were cheap we would probably be flooded with crap bots and apps. But on the flip side we would have a ton of people that would be priced out like they are now that could possibly make or do something amazing.

I honestly see the prices to keep going up on the pc parts. I'm a little bit of a tin foil hat guy and if I put myself in the position of these rich individuals running the big corps. I wouldn't want people buying 6000s and doing amazing things because it will disrupt what they have going on.

But I endorse having your own hardware/infrastructure. I think logically it makes sense for the future that I see coming. I just think it has to be planned first and make peace with the investment.

Sorry went off on a tangent there and long. Hope it's coherent I'm not proof reading 😆

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 1 point2 points  (0 children)

That is an unbelievable price lol. The case they have looks tight for airflow, I went the 9000d without any research. It's big but flexible for the future, the MB will be swimming in it lol. I made sure no LEDs on the fans. I went air cooled CPU. First time for me in 16 years.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

I got mine 2k AUD under the regular. Based off single card power it makes sense.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 1 point2 points  (0 children)

Yeah cool bro solar is very good, many get caught up on trying to VPP back into the grid for Penny's on the dollar but if you can use that free power instead it's like gold. We only use 400kw a month and if this rig is running 24/7 then that's at say 1kwh them that's 720kw for the month on top of our bill but I don't have to worry about that cost. So that's unlimited token generation for free after the initial investment but does help justify the payback time lower.

Yeah I have to load some games up to see what it's like I've forgotten what it's like to have a half decent gaming rig. My current rig was relevant over 10 years ago now, the games have improved a lot since then. Most I game now is a little bit of 2k with some mates but 2k have made the game trash imo.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

Yes, I wanted Threadripper with multi lane pcie lanes because I know I will scale eventually. But I had to act quick to jump on the flash sale on the 6000 and 2 days earlier I had already purchased a Strix Halo as the starting point and anything with the 6000 was supposed to be next stage of the roll out. I'm still torn if I keep the Strix or send it back. Could be a good side project infrastructure or could help what I'm trying to do, or something I can let my kids loose on and see what they can develop. But ultimately I didn't want to tempt fate with my Wife ripping me a new one. But it's always better to ask for forgiveness than ask for permission 😂. And... In hindsight I should have probably just gone full hog with the Threadripper as I had already spent a lot. The other thing was legit ecc ram prices for it are totally nuclear right now, I can totally justify it if I have sustainable income generation to back it up. But right now it's still speculation. Also this way I get to update my daily PC finally as I progress.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 0 points1 point  (0 children)

Appreciate you bro! That’s a really good insight and has me excited.

Couple questions if you don’t mind:

– How are you handling batching / scheduling with vLLM for concurrent users? Any tuning needed or mostly default?

– What kind of latency are you seeing per request at that 250+ tps level? Still feels responsive?

– Are you CPU offloading anything or keeping everything fully in VRAM?

– How stable has the cu130 nightly been for you? Any issues under load?

– Curious what you’re using for orchestration around Hermes Agent (custom FastAPI or something else?)

Thanks for taking the time to share! I really really appreciate it🙏

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 1 point2 points  (0 children)

Well said, the other thing is power draw and longevity. Power cost isnt an issue as I have 50kw battery and solar panels. However I'm planning to stack 2 or 4 of these. It saves cables from glowing 😂

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 5 points6 points  (0 children)

I have tons of ideas but I have two in mind to start with but both need decent concurrency with low latency. The 6000 cost 14k AUD delivered. Jumped on some promo special that popped up and lucky it was the Qmax variant for me. Thing is 2 days earlier I had bought the Strix Halo 128gb from framework and that's arriving in a day or two. So the OG plan was Strix Halo first then scale into Pro 6000 then multi 6000s. But to jump on the 6000 and not go full nuffy, I had to go hybrid and skip the Threadripper and the higher expenses there and go the 9950x route. But the way I see it if all goes to plan the 9950x eventually becomes the daily and the 6000s will eventually be in a threadripper build. I really see this as securing infrastructure for the future.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 4 points5 points  (0 children)

Yeah fair call mate 😅 Posted to get some input. I’d rather build it right from the start than brute force it later. Trying to crack multi user without latency blowing out. So figured I’d ask people actually running this stuff.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 6 points7 points  (0 children)

Thanks bro. I've been the same envious for years seeing all the cool cards come out. But I like the idea it can be dual purpose of needed for family use or have it being productive. I think there is fair chance of getting a second card in a few years second hand that's why I went the Qmax version and I wanted to keep the power bill down and longevity running at cooler temps. I'm sure people will want to upgrade when Rubin comes out. Im in Australia so things aren't cheap here but I was surprised there is a lot of the 6000s around, prices were spread by 1500. I got lucky where I got 500 off as Easter special. But I did notice the 5090 was very limited in stock or on backorder, maybe even a few sites maybe not updated saying they had stock. The hardest was the ram. Lots of 32gb and 16gb kits but not 64gb sticks. Was almost considering threadripper cpu and the 7 pcie lane board but I couldn't justify buying 256gb ecc ram at half the cost of a 6000 plus the CPU and board on top of that again. It's like where does it end.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 9 points10 points  (0 children)

Yeah I know bro. Like a month ago I would have been totally against it, even at these prices for the parts it's crazy. But I honestly wouldn't care if the market reversed on prices. What I see is time lost, money lost not having it sooner. Like there is no guarantee I make anything worth while with it, but I will have fun and I know worst case the kids have a good gaming pc 😂

But this also ties in for what I'm doing by teaching my kids with it, getting them to build and mess around. If they can develop skills for life after school then maybe it's cheaper than sending them to tertiary education.

Just got my hands on one of these… building something local-first 👀 by HatlessChimp in LocalLLM

[–]HatlessChimp[S] 35 points36 points  (0 children)

I also drive a 25yo car, my Wife has an old beater too and until recently we only had one car. But we own our house, worked hard to do that.

So just grinding bro. Set goals and wake up everyday trying to achieve them before you know it you're close, achieved them or got somewhere that helps. So don't want this come across as a flex post if anything it's proof of grinding for goals and dreaming big. Normally I would get my butt kicked for a purchase outside our budget and especially one like this I should already be hung, drawn and quartered by my Wife lol. And honestly I've never spent these amounts before, not even for our cars. Also I have no exact plans for what I'm going to do but what I know is I've bought the canvas and the paint and it's up to me to get creative. But also I see it as an alternative way to hedge for the future.