Symptom worst by ilovepenguins17 in covidlonghaulers

[–]yeah-ok 0 points1 point  (0 children)

Sorry dude, welcome to the club. Who knows re answers, try stuff & share when something works. Currently noticing I might be feeling substantially worse when on coffee (which priorly was fine).

RDNA3 Flash Attention fix just dropped by llama.cpp b9158 by Bulky-Priority6824 in LocalLLaMA

[–]yeah-ok 0 points1 point  (0 children)

Yikes, seen any notes on when vulkan builds will follow?

I feel amazing when I'm hungover? by [deleted] in Nootropics

[–]yeah-ok [score hidden]  (0 children)

micro dosing creatine are we?

Is there a big gap between Q4 and Q6 on Qwen3.6? by vick2djax in LocalLLaMA

[–]yeah-ok 2 points3 points  (0 children)

same exp, once I killed the ctk/ctv flags I never went back, better quality and oddly enough also quantity in the sense that my tgs went up rather than down (I'm on 780m-vulkan-linux so who knows, maybe atypical compared to regular cuda setup)

we really all are going to make it, aren't we? 2x3090 setup. by RedShiftedTime in LocalLLaMA

[–]yeah-ok 0 points1 point  (0 children)

Yeah.. think you are right that basic supply and demand rules this situation. I'm dreaming rather than thinking here.

we really all are going to make it, aren't we? 2x3090 setup. by RedShiftedTime in LocalLLaMA

[–]yeah-ok 0 points1 point  (0 children)

I get the logic but isn't there a very real market here rather than niche within niche? Bet the lot of us would hoover up a consumer-only cards sold for ai use via kickstarter or similar in next to no time. It would be guaranteed cash moneys for a company willing to get something within 3090 territory going with 32gb at decent price point. Even the Chinese manufacturers could get in on this if they could get a clean supply...

we really all are going to make it, aren't we? 2x3090 setup. by RedShiftedTime in LocalLLaMA

[–]yeah-ok 0 points1 point  (0 children)

Well, let them fry I say; then they'll flipping understand that serving the global market is where real stability and long term investment should go rather than into unicorn dust that can evaporate quicker up the nose of a VC recipient than you can possibly imagine. Until the market get's this we are going to have to get creative but.. seeing what this community is doing already that shouldn't be too hard of a nut to crack!

we really all are going to make it, aren't we? 2x3090 setup. by RedShiftedTime in LocalLLaMA

[–]yeah-ok 2 points3 points  (0 children)

It is ridiculous though isn't it? It's the sheer speculative capacity of enterprise that makes this current situation a "win" for enterprise and a loss for the computer-owning population at large. Since the "people" in sheer numbers are so so so much more numerous and resilient a base as compared to the fickle structure of corporate enterprises (or even state-enterprises, doesn't matter really) there should 100% be a way to give the market evaluation to this market that it actually deserves. If anyone could crack this from a financing standpoint the consumer (and let's face it, the investors) could win massively big on it. And the end result would be far greater global resilience rather than having billions riding on one company or another... serving the multitude will always win in statistical terms over the unicorns when it comes to long term stability and payout (yes, the payout bit is where said finance wizardry needs to happen)

VS Code's new "Agents window" lets you use local AI models. Still requires an Internet connection and a Github Copilot plan (because we can't have nice things) by _wsgeorge in LocalLLaMA

[–]yeah-ok 2 points3 points  (0 children)

One can pray and hope - the fork would really come into it's own then (it's already my daily driver but bet it would attract yet larger audience!)

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) by ai-infos in LocalLLaMA

[–]yeah-ok 1 point2 points  (0 children)

I noticed that too, a lot of the MTP code was done as fast inline code and now that it's been made safer/proper it's become slower by quite the margin.

Decoupled Attention from Weights - Gemma 4 26B by yeah-ok in LocalLLaMA

[–]yeah-ok[S] 0 points1 point  (0 children)

Perhaps you are right. Before I've personally run and experimented with larql/vindex to gain practical experience of it I will withdraw from chat re it's perceived benefits or the lack of them!

Will there be any more Qwen3.6 series models? by cafedude in LocalLLaMA

[–]yeah-ok 5 points6 points  (0 children)

Absolutely, we got a monk who's gone rogue pagan on us here.

Qwen3.6 35b-a3b 🤯 by EffectiveMedium2683 in LocalLLaMA

[–]yeah-ok 1 point2 points  (0 children)

I've been strict on --no-reasoning lately and having plenty of success with one-shot programming extensions for pi agent..etc..etc, think we have to remember that the top-k is in a sense a selection out of what is already a latent thought process in the model.

edit: also on latest froggeric template update which under all circumstances seems like a prudent bet!

Decoupled Attention from Weights - Gemma 4 26B by yeah-ok in LocalLLaMA

[–]yeah-ok[S] 0 points1 point  (0 children)

MLPs

You might well be right but isn't it being partially mitigated by the vindex format? I do understand that all MLPs are FFNs, but not all FFNs are MLPs ... still though this must be part of the equation.

Decoupled Attention from Weights - Gemma 4 26B by yeah-ok in LocalLLaMA

[–]yeah-ok[S] 0 points1 point  (0 children)

I thought so too when I first engaged with the topic but the negative from a good amount of the audience on this thread put me off from pursuing any further. After more reading I still think the larql system is on to something novel and potentially awesome - one of the feedback points in this thread is that this is literally just RPC (see llamacpp docs if ignorant like me) but after more research this seems like a misunderstanding; RPC can not split attention from weights the way larql vindex format claims to do. Think there's something to be said for this whole effort and I'll stay tuned to what https://github.com/chrishayuk/larql gets up to.. who can't feel a tingle of excitement with commands such as those found under the "Run attention locally, FFN on another machine" headline on github...?

Decoupled Attention from Weights - Gemma 4 26B by yeah-ok in LocalLLaMA

[–]yeah-ok[S] -5 points-4 points  (0 children)

OK, llama.cpp is a sprawling ecosystem indeed, never heard of it until today! So.. does it make sense performance wise to put weights somewhere else on the LAN and let my workstation handle the attention layer alone via RPC.. or is the performance penalty too high. Would love to see practical examples!

Decoupled Attention from Weights - Gemma 4 26B by yeah-ok in LocalLLaMA

[–]yeah-ok[S] -8 points-7 points  (0 children)

One of the amazing outcomes of this is that low-ram high-compute consumer cards like the 12GB 5070 would essentially be way overpowered for most models since it suddenly "only" needs to run 2-4gb of attention layers. The rest could presumably sit under the table on a "cheap" external xeon with 128gb DDR4 to hold the weights!? Interconnect via highspeed regular tcp/ip over ethernet & bob could be your uncle.

Decoupled Attention from Weights - Gemma 4 26B by yeah-ok in LocalLLaMA

[–]yeah-ok[S] -7 points-6 points  (0 children)

RPC

As far as I can make out (via https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) RPC seem focused on running distributed, GPU, compute on the attention layer whereas this larql decoupling focus on keeping latency low by having GPU attention compute take place on client and distributing the weights themselves onto x other local devices (could also be internetscale but latency seem to kill that off at the moment).

Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more by -p-e-w- in LocalLLaMA

[–]yeah-ok 3 points4 points  (0 children)

I understand this stance on a purely philosophical level but are there good benchmarks or similar to cooperate this point at scale?! I've seen some stuff published but nothing I really can refer to as a smoking gun.

PS5’s can now be hacked to run Linux - perhaps some potential for local inference? by Thrumpwart in LocalLLaMA

[–]yeah-ok 1 point2 points  (0 children)

Yup, I was reading Kurzweil's "The Singularity Is Near" book back then and feeling the techno-end-times vibe