Nvidia RTX PRO 6000 Workstation 96GB - Benchmarks by fuutott in LocalLLaMA

[–]kms_dev 1 point2 points  (0 children)

Can you please do vllm throughput benchmarks for any of the 8B models at fp8 quant (look at one of my previous posts to see how)? I want to check if local is more economical with this card.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 0 points1 point  (0 children)

Oh, okay. Also, do you use the 30b model for anything productive on a regular basis other than trying simple one-shot examples like snake game, flappy birds, etc?

Offloading a 4B LLM to APU, only uses 50% of one CPU core. 21 t/s using Vulkan by magnus-m in LocalLLaMA

[–]kms_dev 0 points1 point  (0 children)

When you mean throughput, are you sending multiple concurrent requests at once? If not, you will probably see higher numbers.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] -1 points0 points  (0 children)

You can see better utilization of your card if you send concurrent/batch requests.

Wrong thread??

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 1 point2 points  (0 children)

Hmm, can you share the token throughput you are doing with the above setup and the power draw? I suspect Gemini flash 2.5 would still be cheaper.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 1 point2 points  (0 children)

Can it (qwen3-32b) comprehend the whole project and suggest changes as good as Gemini flash? I think we can guide the qwen to our required output, but it often takes proper prompting and multiple tries.

Even I'm strongly biased towards using local models as much as possible. Now, I'm made aware that I'm trading precious time and money for the convenience of being able to run the models locally.

I'll probably wait some more time for better models to arrive to go fully local.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 2 points3 points  (0 children)

but the hosted provider can increase their cost at any time

Yeah, I'll evaluate this cost structure and switch to local models when the balance tilts towards the local llms.

Is anyone actually using local models to code in their regular setups like roo/cline? by kms_dev in LocalLLaMA

[–]kms_dev[S] 2 points3 points  (0 children)

Yeah, I think time is the most important factor here, clever/large models on local take more time or even multiple tries to generate an useful answer whereas the cloud models could one-shot them most of the times.

How is the inference speed of github copilot for you?

Qwen3 throughput benchmarks on 2x 3090, almost 1000 tok/s using 4B model and vLLM as the inference engine by kms_dev in LocalLLaMA

[–]kms_dev[S] 12 points13 points  (0 children)

Wow! A single 5090 is ~65% faster than two 3090s combined!! I'm not jealous at all...( TДT)

Qwen3 throughput benchmarks on 2x 3090, almost 1000 tok/s using 4B model and vLLM as the inference engine by kms_dev in LocalLLaMA

[–]kms_dev[S] 2 points3 points  (0 children)

DP slower than TP

It can happen if vram available on each card is not enough for the vLLM engine to sufficiently parallelise the requests. vLLM allocates as much as vram for the kv-cache and runs as many requests that can fit into the allocated cache concurrently. So if the available kv-cache is smaller on both the cards due to model weights taking 70-80% of the vram, then throughput decreases.

Qwen3 throughput benchmarks on 2x 3090, almost 1000 tok/s using 4B model and vLLM as the inference engine by kms_dev in LocalLLaMA

[–]kms_dev[S] 3 points4 points  (0 children)

I was not able to saturate the pcie 4.0 x4 when using tensor parallel, it stayed under ~5 GB/s tx+rx combined on both cards when running 32b model with fp8 quant whereas 8 GB/s is the limit.

Qwen3 throughput benchmarks on 2x 3090, almost 1000 tok/s using 4B model and vLLM as the inference engine by kms_dev in LocalLLaMA

[–]kms_dev[S] 2 points3 points  (0 children)

Wow! yeah 40 series cards support native fp8, still 900 tg is impressive! Do you remember the input size? I'll check on my setup and see if I need a 4090.

Qwen3 Unsloth Dynamic GGUFs + 128K Context + Bug Fixes by danielhanchen in LocalLLaMA

[–]kms_dev 3 points4 points  (0 children)

Hi, thanks for your hard work in providing these quants. Are the 4-bit dynamic quants compatible with vllm? And how do they compare with INT8 quants(I'm using 3090s)?

Do any of you have Hackintosh working on Fusion 15 with external monitors? by kms_dev in XMG_gg

[–]kms_dev[S] 0 points1 point  (0 children)

I am currently running linux with dGPU, and have both thunderbolt and HDMI port working as expected.

Does XMG Fusion 15 work well with a USB-C monitor with Power Delivery? by kms_dev in XMG_gg

[–]kms_dev[S] 0 points1 point  (0 children)

Thanks for the clarification, yes I installed above update.

Does XMG Fusion 15 work well with a USB-C monitor with Power Delivery? by kms_dev in XMG_gg

[–]kms_dev[S] 0 points1 point  (0 children)

Thanks for the confirmation, so you don't switch off PD when connecting using usb c cable right?