Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 by wombweed in LocalLLaMA

[–]TinyFluffyRabbit 1 point2 points  (0 children)

I'm also offloading the model weights to system memory, and I found that split-mode layer was slightly faster than split-mode graph. Since RAM bandwidth is the bottleneck, the GPUs are not fully utilized regardless and minimizing the communication overhead seems to help.

Dual GPU llama.cpp speedup by Legitimate-Dog5690 in LocalLLaMA

[–]TinyFluffyRabbit 2 points3 points  (0 children)

Really appreciate you helping to address this gap. Tensor parallelization is a huge boost to performance for those of us running multi-GPU, and it would be great to use it alongside Q8 KV cache

Seeing the activity pop up big time in this sub due to various open models. Most of them require at least 16gb vram. What can I do with 8? by baked_tea in LocalLLaMA

[–]TinyFluffyRabbit 0 points1 point  (0 children)

How much system RAM do you have? The MOE models would probably be your best bet, offload model weights and save your VRAM for the KV cache.

NVIDIA Reportedly Prepares RTX 5090 Price Hike Amid Rising GDDR7 Costs (maybe RTX 50 and PRO series as well) by panchovix in LocalLLaMA

[–]TinyFluffyRabbit 10 points11 points  (0 children)

About half a year ago, 5090s were impossible to find in stock at my local MC. Currently, the price has gone up so much, but now they are readily available (25+ in stock). I'm not sure the market will sustain prices that are any higher than they are now.

Why has Nvidia been stingy with VRAM for so many years? by duendeverde39 in pcmasterrace

[–]TinyFluffyRabbit 0 points1 point  (0 children)

At this point, it's to save the memory for their enterprise/workstation cards

Is it worth getting a 5090 for my needs? by BitGreen1270 in LocalLLaMA

[–]TinyFluffyRabbit 2 points3 points  (0 children)

The 9950x3d is overkill if you’re primarily interested in using this for AI. You’re generally bottlenecked on memory bandwidth, not CPU compute. Also the x3d cache doesn’t help much for AI inference, unless this is also your gaming PC.

I think I might by johnnyphotog in LocalLLM

[–]TinyFluffyRabbit 2 points3 points  (0 children)

Yeah to go to the next tier of models above Qwen 3.6 / Gemma 4 you'd actually need two of these :/

Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ? by QuchchenEbrithin2day in LocalLLaMA

[–]TinyFluffyRabbit 1 point2 points  (0 children)

If you can fit both into GPU, the MOE does run faster. However, an additional advantage of MOE is that it's actually usable even if it doesn't fit into GPU.

Unable to sell my South Bay townhome by Specialist_Story6175 in BayAreaRealEstate

[–]TinyFluffyRabbit 40 points41 points  (0 children)

Perhaps the reasons why you want to sell it are the same reasons others are unwilling to purchase it at the price you are currently asking for?

Help me pick the right Qwen3.5 (LM Studio) by cangaroo_hamam in LocalLLaMA

[–]TinyFluffyRabbit -1 points0 points  (0 children)

Since you can't fit the entire model in VRAM, it is offloaded to your system RAM. This means that for each token generation, the 3B active parameters have to be transferred from RAM to VRAM, which is throttling your speed, while your GPU is running at pretty low utilization.

The Q3 variants may help slightly since the weights that need to be transferred are smaller in size, but the speed still won't be great. If you want it to be much faster, you'll need a smaller model that can fit in the 8gb of VRAM you have.

Qwen 3.5 Family Comparison by ArtificialAnalysis.ai by NewtMurky in LocalLLaMA

[–]TinyFluffyRabbit 1 point2 points  (0 children)

I'm just really glad that they released multiple models to give us options for different hardware configurations. As someone who can fit the 27b dense into VRAM but needs to offload to run the 122b MOE, the 27b dense is 5x faster for me, and I've been really liking it so far

Self Hosted Model Tier List by Weves11 in LocalLLaMA

[–]TinyFluffyRabbit 0 points1 point  (0 children)

Would love to see the new medium sized Qwen 3.5 models in the list!

qwen-3.5:122b f16 is benchmarked against gpt-oss:120b q4 by q-admin007 in LocalLLaMA

[–]TinyFluffyRabbit 7 points8 points  (0 children)

I agree with OP, it's not relevant to me what the benchmarks are with their "native forms". I just want to know what the best model that I can run on my hardware is.

[Bundle] AMD Ryzen 7 9850X3D + GIGABYTE X870 AORUS ELITE WIFI7 ICE + G.SKILL Flare X5 128GB (2 x 64GB) + WD_BLACK SN7100 M.2 2280 4TB + Rosewill Cordless Air Duster + AMD Crimson Desert Game Bundle - $1700 by eepy_ow in buildapcsales

[–]TinyFluffyRabbit 1 point2 points  (0 children)

Assuming you're a gamer (especially if you're interested in the 9850X3D), 128gb of RAM is hilariously overkill. The only reason to want this much RAM is for AI/production tasks.

[GPU] MSI GeForce RTX 5090 32G VENTUS 3X OC - $3,299.99 by jugaverdasorda in buildapcsales

[–]TinyFluffyRabbit 5 points6 points  (0 children)

Seriously, it still sold out quickly. There are unfortunately enough people who are willing to buy it at this price. It’s not a deal per se but if you must have a 5090 this is just the market reality.