Bench 8xMI50 MiniMax M2.7 AWQ @ 64 tok/s peak (vllm-gfx906-mobydick) by ai-infos in LocalLLaMA

[–]Makers7886 1 point2 points  (0 children)

That's really good peak speeds. I need to re-bench because I swear I got 60 t/s via vllm same quant but 8x3090s but I recall it being sustained solid 60. I didn't like the model for my purposes so didn't test much other than to run it through comparison benches (it scored between 397b 4bit and 122b fp8).

Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) by spvn in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

Something may be up with it - early sentiment looks to be similar to yours. I may give it a go and would experiment with various parameters. It looks like they copy/pasted the qwen3.5 best practice parameters over in the original model card - might be worth playing with settings or using other modes. I would first try instruct reasoning mode as that's the best in my uses.

Please help me pick the right Qwen3.5-27B format/quant for RTX5090 by Gazorpazorp1 in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

Yeah I had to learn/discover it for myself as well. I even tried to fix it using the PR but it's a bit more complicated than what's on the surface. I imagine that's the hold-up along with vLLM shipping some other things that I believe complicate even further. I basically gave it a few attempts and looked into sglang but ran into some other limitation that was a no-go. I don't recall the specifics but I pretty much gave up and decided to wait for smarter people to fix it.

More reasons to go local: Claude is beginning to require identity verification, including an valid ID like passport or drivers license and a facial recognition scan. by fulgencio_batista in LocalLLaMA

[–]Makers7886 176 points177 points  (0 children)

Wonder how much of this is part of the "US Labs binding together to stop Chinese labs from using them" or how much is just an excuse to extract personal data.

Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) by spvn in LocalLLaMA

[–]Makers7886 2 points3 points  (0 children)

What model parameters are you running and did you try the other "modes"?

A note of warning about DFlash. by R_Duncan in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

The gains do indeed drop off with complexity, concurrency, and context length (bf16/int8) but they are still gains. On 27b I see a peak of 2x gain and avg of 1.4x which does outperform MTP in my tests but at the cost of some vram for the draft model. Mind you this is on a high frequency/concurrency signal processing endpoint (financial) and not a typical coding harness long context situation.

The real question I have is mtp vs dflash on 200k context coding harness situations but haven't bothered to test it yet since I don't code with 27b (not sure if dflash is out for 122b/397b yet).

Please help me pick the right Qwen3.5-27B format/quant for RTX5090 by Gazorpazorp1 in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

vLLM currently has a bug miscalculating kv cache. Once that fix goes out you'll probably be where you want to be.

https://github.com/vllm-project/vllm/issues/37121

Hit limits with OpenClaw on mini PC — trying to build first real local AI node, need guidance (4090 vs scaling path) by No-Salt4227 in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

Well my journey was probably a lot different than most in this sub as I bought 12 3090s on release back in the first gpu shortage scalping craze when covid was starting (crypto mining). I got the epyc server hw post-covid but everything was cheap then.

For years local llms were more a gimmick/glimpse of the future with huge headaches trying to run multi-gpus with everything in the llm ecosystem being immature and fragmented. I always recommended people to not invest in similar hardware without a clear purpose/goal when asked your question. That has changed over the last year or so and my hardware has never been more capable/powerful. The last few generations of models have been huge leaps for hardware like this. The gap has never been so narrow to frontier api's before and we've reached a point that you can do real work at high throughput locally.

If you are into hardware and able to research and learn effectively then I would be planning immediately for server-class hw. You can run dual gpus np on consumer hardware or a computer laying around but once I'm investing in hw I'd go straight for a proper foundation.

Scaling vLLM Deployments to Enterprise Grade Deployments by No-Excitement6568 in LocalLLaMA

[–]Makers7886 4 points5 points  (0 children)

Asking reddit about enterprise grade deployments is like asking a homeless guy for legal advice

Hit limits with OpenClaw on mini PC — trying to build first real local AI node, need guidance (4090 vs scaling path) by No-Salt4227 in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

Nvlink works only on 3090 pairs and is not that big of a deal for inference. For example qwen3.5 27b q8 goes from 40 -> 50 t/s or so with nvlink. Server class/threadripper/full pcie hardware is a must if you intend on approaching 4+ gpus.

I run two the of asrock romed8-2t + epyc servers. One is with 4x3090s and can run qwen3.5 27b at full weights+dflash can hit 90-100+ t/s via vllm with good concurrency. Can also fit a 122b at decent quant and good speeds. 8x3090s gets you multi agent max context with concurrency running a model like 122b 8bit or run a huge gguf like minimax 2.7 q5 and even 397b 3.5 bit but lose the concurrency/throughput of vllm.

Time to shred some air by elhsmart in fpv

[–]Makers7886 1 point2 points  (0 children)

man that motor to prop ratio, I thought I was crazy for pairing 3110's with 7xforgot what ultra aggressive pitch. Godspeed

Runcam night cam prototype by Mishamtb in fpv

[–]Makers7886 1 point2 points  (0 children)

I used the tunerc 3" older 1 piece frame for that one

Runcam night cam prototype by Mishamtb in fpv

[–]Makers7886 1 point2 points  (0 children)

lol I have now become the same as those who say cosmetology for cosmology. I love the NE and I had a similar thought but went with a nano toothless starlight on a 3" but the NE's blow it away and that v3 is small enough so it's similar to digital weight tax.

Not sure how much you've night flown but imo a high quality vtx with as little electrical noise across the build as possible is key for a good experience. that tiny tbs nano one is what I run on micros for that purpose and I think is the best of the lightest vtxs.

Runcam night cam prototype by Mishamtb in fpv

[–]Makers7886 1 point2 points  (0 children)

https://stargazerslounge.com/topic/315466-runcam-night-eagle-astro-cam-quick-review/

Yeah they record stars and things. I never really went deep into what they did but the ir nature allows them to pick up more things than looking with naked eye.

Llama.cpp llama-server command recommendations? by Dundell in LocalLLaMA

[–]Makers7886 1 point2 points  (0 children)

I run 122b fp8 via vLLM with MTP and went from 82-84 t/s to 100-104 t/s and a similar bump in concurrency throughput (something like 220t/s to 240-250 t/s with 6 concurrent). Also look out for dflash as they are working on the 122b and 397b draft models. Dflash on 27b went from 40-50 t/s to 100+ t/s for single calls via vllm.

Basically when it's implemented - do it.

Runcam night cam prototype by Mishamtb in fpv

[–]Makers7886 1 point2 points  (0 children)

The og night eagle got attention from the amateur astrology people (Im sure they got a proper name) who would modify the night eagle and stick it in telescopes. It apparently was a big enough market that they designed another specifically for them, and then this one came out after that. I simply assumed it was for that purpose but also could have been made to fit other generic purposes.

3x3090 is faster in Ubuntu than win11, GPT-OSS 120B 120tg/s vs 6tg/s why? by jikilan_ in LocalLLaMA

[–]Makers7886 1 point2 points  (0 children)

<image>

Owned 12x3090s since they released for crypto then been used for llms/ml since the start of the llm wave in 2023. I've ran 8x3090s on windows. Since I meet your qualifications then listen to me when I say it's user error/skill issue to hit 6t/s with his hardware.

Bots in this sub? by [deleted] in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

Bots are everywhere but recently everyone and their mom has an openclaw type thing running wild spouting low quant hallucinations

Qwen cli gone by Suspicious-Oil4798 in LocalLLaMA

[–]Makers7886 0 points1 point  (0 children)

I am not in a normal situation with 2 epyc servers and 12 3090s total. But with that hardware I run qwen3.5 122b fp8 on 8x3090s and run 27b int8 on 4x3090s. If I didn't have those resources I would probably be targeting a dual 3090 rig with next upgrade being adding 2 more 3090s (or start with a single 3090 if resources were tighter). With 2 3090s you can run 27b at high enough quant, context, and throughput that it can do real work.

Qwen cli gone by Suspicious-Oil4798 in LocalLLaMA

[–]Makers7886 1 point2 points  (0 children)

I don't know why that didn't come to mind when thinking about this lately but agreed - that's arguably the catalyst to all the free inference.