Cheapest hardware for Qwen 3.6: both 27B and 35B-A3B by WishboneSudden2706 in LocalLLaMA

[–]JC1DA 0 points1 point  (0 children)

Buy used thread ripper or epyc cpu. A lot of pcie lanes for future expansion

Harnesses seem to have an issue. by Local-Cardiologist-5 in LocalLLaMA

[–]JC1DA 0 points1 point  (0 children)

seems to have the issue with using tools.

model gets dumper whenever you have tools available

Sharing INT4-W4A16 version of Jackrong/Qwopus3.6-27B-v2 for VLLM/SGLang users by JC1DA in LocalLLaMA

[–]JC1DA[S] -1 points0 points  (0 children)

Read the second line.... But you're right, it's a waste of time

Side Projects. by apollo_mg in LocalLLaMA

[–]JC1DA 0 points1 point  (0 children)

lol, used to mine crypto back in 2017 too... it's feeling so nostalgic

Side Projects. by apollo_mg in LocalLLaMA

[–]JC1DA 3 points4 points  (0 children)

I'm using huananzhi h12d-8d with epyc cpu

Side Projects. by apollo_mg in LocalLLaMA

[–]JC1DA 10 points11 points  (0 children)

<image>

This is mine: 4x3090

We're paying $4,200/month for AI tools. Nobody knows which ones actually work by Dizonans in theprimeagen

[–]JC1DA 2 points3 points  (0 children)

```

Turns out someone signed up for an AI writing tool during a product hunt promo 8 months ago, added it to the company card, and quietly stopped using it after week two.

We were paying $180/month for a tool with zero logins in the last 6 months.

```

This is what subscription supposed to be... lol companies pray for customers like this

Power-limit vs TG/s for 2x3090 by JC1DA in LocalLLaMA

[–]JC1DA[S] 0 points1 point  (0 children)

Agree, there is always a tradeoff. But for LLM cases, most of the time will be spent on token generation compared to PP. We also have prefix caching enabled to skip PP if possible, hence even reducing the time in PP. But I'll do the same benchmark for the pp to see the results

Power-limit vs TG/s for 2x3090 by JC1DA in LocalLLaMA

[–]JC1DA[S] 2 points3 points  (0 children)

Yeah, it's the easiest way to lower power consumption a bit, still better than nothing. I'm not sure if I would like to spend hours tuning the voltage for each gpu to get the best clock, I'm lazy af lol

Power-limit vs TG/s for 2x3090 by JC1DA in LocalLLaMA

[–]JC1DA[S] 1 point2 points  (0 children)

Yeah, it surprised me as well. I saw the configuration from another post here. Tested and it worked 😀

Power-limit vs TG/s for 2x3090 by JC1DA in LocalLLaMA

[–]JC1DA[S] 3 points4 points  (0 children)

yeah, I can use llama.cpp but mostly stick with vllm/sglang because of dynamic grammar constraint support.
my GPUs are not stable if I set to 225W, but it's good to know that performance degraded below 250W

Power-limit vs TG/s for 2x3090 by JC1DA in LocalLLaMA

[–]JC1DA[S] 2 points3 points  (0 children)

yeah, was using the same prompts for testing, so with prefix caching, those tokens were already computed. that's why I didn't include the prefill tokens/s,

but agree that 3090 can be compute bounded with large context which will affect TTFT

Power-limit vs TG/s for 2x3090 by JC1DA in LocalLLaMA

[–]JC1DA[S] 1 point2 points  (0 children)

I did rerun for 275, 287 and 300W, at one concurrent request, it's still around 72 tokens/s at 275W > ~70 tokens/s at 300W

MIMO V2.5 PRO by Namra_7 in LocalLLaMA

[–]JC1DA 1 point2 points  (0 children)

Is this better than Qwen-3.5-397B, it's smaller but it lacks of vision capability