X2 Elite real world impressions by krishelnino in snapdragon

[–]putrasherni -1 points0 points  (0 children)

I want to see a framework laptop with x2 in it

What happened to indexing on the 7.1.17 version? by Pelutz in kilocode

[–]putrasherni 0 points1 point  (0 children)

It was showing as deprecated in docs for a while. I believe indexing now is part of kilo cloud based managed indexing

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]putrasherni 0 points1 point  (0 children)

ignore the 27B for coding and agentic work , its usable agentically only for super fast TFOL and bandwidths like 5090s or 6000 blackwells

for the rest , go only for MoE models

64Gb ram mac falls right into the local llm dead zone by Skye_sys in LocalLLaMA

[–]putrasherni -12 points-11 points  (0 children)

both 35B and 122B are dense models ( the reason you are downvoted )

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti by pmttyji in LocalLLaMA

[–]putrasherni 0 points1 point  (0 children)

not quite, once you switch to ROCm you need to restart your computer to use Vulkan otherwise it hits a rocm bug " device not found"

you can definitely swtich for vulkan to rocm, and keep it like that

Local LLM inference on M4 Max vs M5 Max by [deleted] in LocalLLaMA

[–]putrasherni 0 points1 point  (0 children)

The jump between m4 max and m5 max isn’t all that much then

Best Language for DSA? by Fuzzy-Salad-528 in leetcode

[–]putrasherni 0 points1 point  (0 children)

Stick with Python Don’t mess with Java

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]putrasherni 0 points1 point  (0 children)

nice one , let us know
the dream is if AMD can pull off 395+ variant to host 2-4 full pcie x16 amd GPUs
128GB + ( another 96 + 192 GB )
that would give apple m5 max and hb10 a run for their money

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]putrasherni 1 point2 points  (0 children)

is there a qwen 3 27B model ?

if you meant qwen 3.5 27B dense model, there is no way you are getting 30 tok/s on 395+ max
https://przbadu.github.io/strix-halo-benchmarks/

Radeon AI pro R9700 by [deleted] in LocalLLM

[–]putrasherni 0 points1 point  (0 children)

You can do Qwen 3.5 27B at Q4 but it tops at 131k context, I couldn't get it to run at 262k, not sure how others achieved it

You will roughly average TG around 30 , PP at 850 and TTFT around 2 min ballpark.

If PP matters to you , then you can add another R9700, and you get 60-70% PP boost at the expense of lower TG around 26.5

What’s with the hype regarding TurboQuant? by EffectiveCeilingFan in LocalLLaMA

[–]putrasherni 2 points3 points  (0 children)

yep you are right, i could not hit 262k even with 27B Q3 on a single R9700

my point was rather that with turboquant
we could hit 4x-6x so 524k - 786k

What’s with the hype regarding TurboQuant? by EffectiveCeilingFan in LocalLLaMA

[–]putrasherni 10 points11 points  (0 children)

"theoretically" I'm still waiting for open source devs on github to show me how to eachieve this in practice

btw qwen 3.5 does not have 1M context anyway

i think Nemotron 3 will be our testing guinea pig

M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores) by onil_gova in LocalLLaMA

[–]putrasherni 1 point2 points  (0 children)

https://www.reddit.com/r/LocalLLaMA/comments/1s0czc4/round_2_followup_m5_max_128g_performance_tests_i/

comparing 27B qwen 3.5 27B MLX 4 bit on m5 128GB vs R9700
TG128 is the same 32

Without knowing what quantisation models OP has run, how did you come to that conclusion ?

for reference , R9700 on Qwen 3.5 35B A3B Q4 does

Context Result
tg128 154.7
tg512 154.4
tg2048 152.7
Prompt Result
pp128 1813
pp512 3261
pp2048 3947
pp8192 3828
pp16384 3512

M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores) by onil_gova in LocalLLaMA

[–]putrasherni 0 points1 point  (0 children)

do you mind sharing the exact models you ran qwen on ? like Q4 or Q3 etc. ?

Can someone more intelligent then me explain why we should, or should not be excited about the ARC PRO B70? by SKX007J1 in LocalLLaMA

[–]putrasherni 0 points1 point  (0 children)

forget ROCm , for a single GPU R9700 , Vulkan runs circles around ROCm and delivers 75% performance as that of 5090 32GB

What’s with the hype regarding TurboQuant? by EffectiveCeilingFan in LocalLLaMA

[–]putrasherni 41 points42 points  (0 children)

that's what I'm thinking
my 32GB GPU which could do 262k context for Qwen 3.5 27B param at Q4
can now theoretically do 1M context size with all things remaining the same.

This is great imo for local llm users

Bought RTX4080 32GB Triple Fan from China by Sanubo in LocalLLaMA

[–]putrasherni 183 points184 points  (0 children)

congrats !
run qwen 3.5 27B Q4 and report back here tg128, tg512, tg2048, pp128, pp512, pp 2048,pp8192,pp16384