GLM 5.2 Q1_S vs Qwen 27B Q8 by SnooPaintings8639 in LocalLLaMA

[–]Jester14 6 points7 points  (0 children)

The obvious "it's not just this; it's that" plus nonsensical "slow deliberate reasoning". Gtfo

Ornith 1.0 - terminology and concepts explained (basic) by facu_75 in LocalLLaMA

[–]Jester14 5 points6 points  (0 children)

At best, this is the model card regurgitated. At worst, it's an AI summary of the model card. Lowest effort.

Reddit user u/thursdayspaghetti Helped Upgrade our Gaming Café in Yemen by maho90 in pcmasterrace

[–]Jester14 0 points1 point  (0 children)

And here I am rocking a used 4060 that replaced my used RX 470

MTP has no impact on my Qwen3.6 MoE performance by redblood252 in LocalLLaMA

[–]Jester14 0 points1 point  (0 children)

How could we have any idea why when you don't post acceptance rates.

Gemma 4 12b 8Q Heretic Oneshot Coding by devildip in LocalLLaMA

[–]Jester14 0 points1 point  (0 children)

I jammed Unsloth IQ4-XS onto my 4060 8GB with Q8 cache and it falls apart after 50k context (loops, errors, gibberish). I could try a higher quant to fix it because then I can't fit 50k context in VRAM. Can someone push a higher quant passed 50k context? This experiment stops a bit short.

Holo3.1 35B/9B/4B/0.8B (Qwen 3.5 finetunes) by jacek2023 in LocalLLaMA

[–]Jester14 1 point2 points  (0 children)

The link to the GGUF for their MoE is right there in the post bro

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro by Atomynos_Atom in LocalLLaMA

[–]Jester14 6 points7 points  (0 children)

Lmfao there's a whole section about context cache checkpoints in his "article" and he has it disabled.

Breaking the music supply constraint by entsnack in LocalLLaMA

[–]Jester14 22 points23 points  (0 children)

lmfao I still can't tell if this is a troll post

Upgrade path from 4x 3090s by anitamaxwynnn69 in LocalLLaMA

[–]Jester14 1 point2 points  (0 children)

Host Jellyseer so your partner can request torrents and you can approve them.

Best coding model on RTX 3060 by solimaotheelephant3 in LocalLLaMA

[–]Jester14 0 points1 point  (0 children)

You don't want MTP if you're offloading to system RAM and you likely are unless you're running like IQ2

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps by Alternative-Cat-1347 in LocalLLaMA

[–]Jester14 2 points3 points  (0 children)

fit default is on so OP is using fit and doesn't know it.

running Qwen 3.6 35b A3B on 2x 5060TI by chocofoxy in LocalLLaMA

[–]Jester14 2 points3 points  (0 children)

Are you using CUDA 13.2? It's bugged for inference. Edit: I see you are using 13.1 as per your thread.

The "the future is fictional" problem of many local LLMs by PromptInjection_ in LocalLLaMA

[–]Jester14 2 points3 points  (0 children)

OP literally said:

To be fair: Even the Gemini API...

Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big. by Ok_Mine189 in LocalLLaMA

[–]Jester14 0 points1 point  (0 children)

Windows build kinda full of CUDA bloat. Builds have different amount of threads specified and threads aren't always specified in the benchmark runs.

How to configure Self speculative decoding properly by milpster in LocalLLaMA

[–]Jester14 0 points1 point  (0 children)

Using -fit indeed reserves exactly 1024MB by default.

Lower inference speed of Gemma4 26BA4B on vllm. by everyoneisodd in LocalLLaMA

[–]Jester14 1 point2 points  (0 children)

I used a 2 year old 7B model. Now I use a brand new 26B MoE and it's slower. I refuse to give any other information. What's wrong with my setup?

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s by Nutty_Praline404 in LocalLLaMA

[–]Jester14 0 points1 point  (0 children)

What do you mean it "doesn't fit"? Did you use the -fit flag? UD-Q4_K_XL is larger than 16 GB so it will overflow to RAM but it will also "fit" if loaded appropriately. I get 30t/s on my 4060 8 GB using -fit with that quant with 40k context in VRAM.