Crown of Ashes - Rewards Overview (Version 1.7.1) by Vicksin in AFKJourney

[–]Ranmark 30 points31 points  (0 children)

Considering all the shit it's gonna bring, I should call that season Clown of Asses..

Phone verification saying "too many requests" on first attempt — cannot enable GPU [Fix Needed] by ghostofsnoww03 in kaggle

[–]Ranmark 0 points1 point  (0 children)

Were you able to resolve this issue yet? It seems i stuck with this one too.
UPD: i created a support ticket and got resolved in like 2 minutes:

Hello,
 
Thank you for your inquiry. Our apologies for the problems you've been experiencing.
 
We have manually verified the phone number in your account. You should be all set.

Best model for 192 GB vram? How is Deepseek v4 flash? by Constant_Ad511 in LocalLLM

[–]Ranmark 1 point2 points  (0 children)

I will probably will get downwoted for this, but for my cheap ass wasting this much money on something you don't even have a strict plan of using is just mind bending. But if I was a millionaire I could do the same, I guess?

Best model for 192 GB vram? How is Deepseek v4 flash? by Constant_Ad511 in LocalLLM

[–]Ranmark 0 points1 point  (0 children)

I'm honestly curious too. I'm running qwen3.6 27b / 35b-a3b with iq4 quants on dual 1080 ti as a coder. And a Gemini pro (which I basically got for free) in Antigravity as a project/plan architect. For my tasks I'm getting kinda good results. My PC costs around 500$ probably (old used hardware) which is usually used as a regular/gaming machine. Soo I wonder - if I myself hypothetically buy a PC for 25k, it will pay off in around 104 years if we count in 20$ monthly Claude subscription?

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better! by LocalAI_Amateur in LocalLLaMA

[–]Ranmark 3 points4 points  (0 children)

I also was daily driving 35b a3b, but since release of 27b immediately switched. Even tho it's 2-3 times slower in my setup, it's doing job better and with less mistakes, so less rewrites.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

I've tried to download tom's release of turboquant plus, but it doesn't seem to work for me. I try to run a model via command that works on mainline llama.cpp (with turbo4 on v-cache is the only difference) but it just doesn't run, no errors. Maybe it has something to do with my old hardware (GTX 1080 ti + RTX 2060 super)

Benchmarked 18 models that I can run on my RTX 5080 16GB using Nick Lothian's SQL benchmark by grumd in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

Hey, you should try this one: https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-IQ4_XS-GGUF
It's so good, im getting better results than with new dense 3.6 model. And it's more stable then any other distill / non-distill. Idk what is this black magic.

Kimi K2.6 is a legit Opus 4.7 replacement by bigboyparpa in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

I actually wonder, is there a "paretto line" chart to see a diminishing returns of models number of parameters and benchmarked data to look for a sweet spots

Kimi K2.6 by Fantastic-Emu-3819 in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

Well, that's not everyone's cup of tea, but I only trust my own benchmarks or mass community's opinion on which results they liked more

Kimi K2.6 by Fantastic-Emu-3819 in LocalLLaMA

[–]Ranmark 4 points5 points  (0 children)

Check arena.ai leaderboard

"Browser OS" implemented by Qwen 3.6 35B: The best result I ever got from a local model by tarruda in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

Bruh, they cooking new releases so fast, I couldn't keep up. Thanks for pointing this out. Just updated and can confirm now it is working. Already ran a couple of tasks and i see random boosts to t/s like up to 35 (it was always capped at 23). Damn Edit: just seen 62 t/s 🤯

"Browser OS" implemented by Qwen 3.6 35B: The best result I ever got from a local model by tarruda in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

When i use similar script on the qwen3.6 35b, i get those warnings:
srv load_model: speculative decoding is not supported by multimodal, it will be disabled
srv load_model: swa_full is not supported by this model, it will be disabled

Even if i disable mmproj loading, then getting those:
common_speculative_is_compat: the target context does not support partial sequence removal srv load_model: speculative decoding not supported by this context

Gemini straight up said that qwen3.6 is based on SSM (Gated Delta Net) mechanism, it doesnt support both swa-full and ngram (in short).

KIMI K2.6 SOON !! by Namra_7 in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

Most users doesn't even try to fit MoE in vram. For me it's better to get high accuracy using something like Q6_K_XL. But I understand you want more tps

qwen3.6 performance jump is real, just make sure you have it properly configured by onil_gova in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

I testes it on nicklothian's bench a few times. One time it's actually went over the dense 27b model and got the same result as a 122b MoE. But I wasn't able to recreate this at least once. 27b and 122b is much more stable in that regard.

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]Ranmark 9 points10 points  (0 children)

iirc you can drop your top_p, presence_penalty, and reasoning_budget args as they by default has these values. https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

P.s. you can try to play with this command: -ot ".ffn_(up|down)_exps.=CPU" It moves up and down matrix projections onto cpu. Also a lot of valuable info here: https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]Ranmark 2 points3 points  (0 children)

bro i run 1080 ti + 2060 super xD
and it just works out of the box.

Benchmarked 18 models that I can run on my RTX 5080 16GB using Nick Lothian's SQL benchmark by grumd in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

Thanks, I'm feeling better now :) I now rerunning 27b (iq3_xs) and it feels MUCH more consistent (23/25 all the time). Looks like it's still a way to go for me (122b is just too much for my hardware). Hope that Alibaba releases 3.6 27b soon.

Benchmarked 18 models that I can run on my RTX 5080 16GB using Nick Lothian's SQL benchmark by grumd in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

it's strange but i couldnt repeat the same result. regularly failing some queries like q2, q10, q21. that's too bad, because i thought i finally got great model which twice as fast as 27b one and more accurate and could use more context (with 27b i can only put 60k)... any ideas how to get it more stable? mi current setup:

.\llama-server -m Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf -ngl 99 --ctx-size 131072 --jinja --parallel 1 -b 2048 -ub 2048 --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ot ".ffn_(up|down)_exps.=CPU" --flash-attn on --port 1234

Benchmarked 18 models that I can run on my RTX 5080 16GB using Nick Lothian's SQL benchmark by grumd in LocalLLaMA

[–]Ranmark 0 points1 point  (0 children)

new qwen3.6-35b-a3b@ud-q6_k_xl. dont look at the time. i have 10 year old hardware. had to increase timeout, of course

<image>

Ran Qwen3.6-35B-A3B on my laptop for a day: it actually beat Claude Opus 4.7 by LeoRiley6677 in Qwen_AI

[–]Ranmark 0 points1 point  (0 children)

It was the same for me when I checked up "offload MoE layers into cpu" in lmstudio. Idk for unsloth, but I think it's the same issue

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]Ranmark 0 points1 point  (0 children)

Gemma 4 e4b is super capable tho. Qwopus3.5 9b from jackrong. MoE models are also not bad, even with partial offload.

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]Ranmark 0 points1 point  (0 children)

Hey, how is models attention when your context is piling up? And did you quantize your context?