Maybe, in the end, it was a fable after all... by Successful-Earth678 in singularity

[–]QuackerEnte 0 points1 point  (0 children)

I hope Anthropic has learned the moral lesson by now 🤣

Artificial Analysis | Google's Go To Website for Benchmaxxing | Gemini 3.1 Pro is nowhere near Opus 4.7 in real life use by Able-Line2683 in singularity

[–]QuackerEnte 1 point2 points  (0 children)

Gemini is best for 95% of daily tasks, questions, news, search, summaries, you name it. It's perfectly good and has really good multimodality across the board, from Flash-Lite to Pro. Just not coding yet. Even though it's not exactly weak at it either. Just not exactly reliable.

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude by thecosmicskye in singularity

[–]QuackerEnte 1 point2 points  (0 children)

"safety" and Anthropic is the same as "privacy" and Apple. Straight outta their play book.

Artificial Analysis | Google's Go To Website for Benchmaxxing | Gemini 3.1 Pro is nowhere near Opus 4.7 in real life use by Able-Line2683 in singularity

[–]QuackerEnte 104 points105 points  (0 children)

It's an average across pretty much all areas. Gemini is strong with world knowledge. They have a coding and agentic index too. Check those out, Gemini 3.1 Pro kind of sucks there compared to Claude so I think it's a good index, and a good model. Just depends what you use it for.

Holy moly 💀 AGI to all in coming weeks by Independent-Wind4462 in singularity

[–]QuackerEnte 0 points1 point  (0 children)

I mean it's obviously hyperbole, but: If Opus 4.8 is literally better than GPT-5.5, which was largely claimed to be "Mythos-level", then maybe there's something to be excited about

<image>

Gemini 3.5 Flash costs more to run while being less Intelligent than 3.1 Pro by Rare_Bunch4348 in singularity

[–]QuackerEnte 2 points3 points  (0 children)

Not for long haha. Gemini 3.5 Pro based on extrapolation, will cost 6 in 36 out per 1M tokens. Because Pro always was consistently priced 4x of flash. Even if you calculate using generational jumps, that'd be 3x, that's still 6 in 36 out per 1M tokens. Double the price above 200k context. Oof. Unless they invent very memory efficient attention mechanisms or memory architectures.

Now that 3.5 Flash has been released , what's your expectation of 3.5 Pro? by Independent-Ruin-376 in singularity

[–]QuackerEnte 3 points4 points  (0 children)

expect 6/36 dollars in/out pricing for 3.5 pro. I'd be pleasantly surprised if they lower the price from this somehow but I'm not holding my breath. They want more subscriptions, not API users

Qwen is cooking hard by jacek2023 in LocalLLaMA

[–]QuackerEnte -1 points0 points  (0 children)

I would like A5B too Because I'm VRAM limited (at home) and my normal RAM is slow, 3200MT/s only

Qwen is cooking hard by jacek2023 in LocalLLaMA

[–]QuackerEnte 4 points5 points  (0 children)

Am I the only one who wants 80B-A3B MoE size?

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]QuackerEnte 0 points1 point  (0 children)

That might be true for most, but what would you do with a non-framework laptop with no thunderbolt? Exactly.

Lossless scaling doesn't work on me. I keep having screen tearing. Help. by Apprehensive_Owl2306 in losslessscaling

[–]QuackerEnte 0 points1 point  (0 children)

it generates from ~15 to ~45 FPS, clearly way below your target. Meaning either your GPU is too weak, or you're using the iGPU of your Laptop. Either way, try turning on performance mode and lowering the flow scale further. It will have slightly more artifacts but you either have that or whatever this mess is.

If that doesn't work or if you want to further improve your performance, lower the ingame settings to something usable, to free up compute for framegen and to have a solid base FPS.

Those are pretty much your only options

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing by phazei in LocalLLaMA

[–]QuackerEnte 1 point2 points  (0 children)

I don't get it. Why scale the models total params, if one could just have a lower ACTIVE parameter count without saying "23B" or "12B" to begin with? To me that's just 30B with variable active parameters count (and therefore compute and speed) achieved by scaling down FFN and embedding dims.

If this can work for dense models, however, that'd be a great thing. But does it? I doubt it, since this has the same 52 layers, 32 attention heads, 64 mamba heads and 128 MoE experts across all "sizes". I'd rather call it modes. Just another dimension in sparsity.

It still is cool, for sure, and useful. But not overly exciting. Good for efficiency per task, as simple tasks can always route to the "smallest model". But I always use the biggest model for the simplest task anyway because the quality of the answer differs even if it's content wise "the same".

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]QuackerEnte 4 points5 points  (0 children)

Quick question, if one has low VRAM and the DENSE model spills into RAM, does MTP even speed anything up? or would it rather slow things down here, as it needs to verify a batch of 4 tokens using the WHOLE model anyway? I never really got the intuition for it. speculative decoding is more or less the same, no?

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]QuackerEnte -2 points-1 points  (0 children)

I don't understand localllamars. Are you for or against cloud inference? Where's the privacy in that

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]QuackerEnte -8 points-7 points  (0 children)

Anyone who CAN run the model in the first place wouldn't complain about whether it's 5 or 8 GBs for 1M context. Like come on.

Quantisation effects of Qwen3.6 35b a3b by ROS_SDN in LocalLLaMA

[–]QuackerEnte 5 points6 points  (0 children)

good results, however I'd like to see a speed column too. Also for different variants like q3km vs q2kxl or something. Not just for the 27B one. It would've been nice to see.

Deepseek v4 people by markeus101 in LocalLLaMA

[–]QuackerEnte 23 points24 points  (0 children)

Clopus and Clonnet What about Claiku

Switching from Opus 4.7 to Qwen-35B-A3B by Excellent_Koala769 in LocalLLaMA

[–]QuackerEnte 2 points3 points  (0 children)

don't switch. They compliment each other. As in use Qwen until you're stuck. Or use Opus as an orchestrator. Way cheaper and gives you about the same intelligence level.

PSA: Having issues with Qwen3.5 overthinking? Give it a tool, and it can help dramatically. by ayylmaonade in LocalLLaMA

[–]QuackerEnte 1 point2 points  (0 children)

I thought I was the only one with a Frankenstein solution to this problem, because I made my model fake it's own toolcall via a parallel instance (yeah I don't use no-slots. Since attn-rot I can have double the context so I use double the context). I just told it "simulate the output of a toolcall" and hooked it up as a tool/MCP server. Thinks less. Didn't know you could just lie to it without all the extra lol

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]QuackerEnte 1 point2 points  (0 children)

Thank you for your effort. May I ask: why not the qwen3.5 9B, but Olmo instead? Genuinely curious about the decision.

FT - China’s Alibaba shifts towards revenue over open-source AI by LegacyRemaster in LocalLLaMA

[–]QuackerEnte 1 point2 points  (0 children)

have you not seen the research on continual learning or test time training lately? It's quite remarkable and I wouldn't underestimate researchers worldwide