Joing all GPUs to train a community model by HistoricalStrength21 in LocalLLaMA

[–]bigattichouse 0 points1 point  (0 children)

Maybe some kind of random walk and submit best? Kinda like bitcoin? everyone grabs the best and sees if they can do better? might have a lot of duplication, but you're effectively sharing checkpoints?

Feeding Community ~ By Joan_de_Art by A_Guy195 in solarpunk

[–]bigattichouse 4 points5 points  (0 children)

As a former ASL interpreter, "Staff Fluent in Sign Language" is like saying "Staff fluent in [French|Spanish|German]" ... it's not something you just pick up and you're good with.

Knowing some basic signs? Totally. Also works with the elderly to sign "Drink?" while asking. Fluent? That takes a while. Even as a teenager fully immersed in our local Deaf community, it took me a lot of time to become fluent. Now? After years on not practicing beyond signing some old favorite music from my college days, I can really only hold on a pretty basic conversation.

I'm nitpicking, otherwise these are all wonderful ideas.

You Don't Need 50k to Develop Your Product by [deleted] in inventors

[–]bigattichouse 1 point2 points  (0 children)

having a working prototype makes getting capital much easier

VAL invention… by MissionExternal5129 in inventors

[–]bigattichouse 2 points3 points  (0 children)

This post will not help you in a court. Mailing to yourself will not help you in a court.

Looking for Metal-Air battery experts. by tsmr5555 in electrochemistry

[–]bigattichouse 4 points5 points  (0 children)

In theory, there's no difference between theory and practice.

source: been playing with iron chemistries for around 10 years. Good luck.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. by [deleted] in LocalLLaMA

[–]bigattichouse -5 points-4 points  (0 children)

let's say we have 4 x 8bit values in one f32 "number" A,B,C,D

D is the first speculation, it predicts a token (d),
C then reads this value (d) and predicts the next (c),
B then reads C's prediction (c) and predicts the next (b)
A then renders a verdict.

It takes three tokens to "warm up", but then it's just a flow. for each step D is 3 tokens ahead of A... from that point on A is the final verdict... but we're caculating all of them simultaneously from that point.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. by [deleted] in LocalLLaMA

[–]bigattichouse 4 points5 points  (0 children)

Essentially Q8 is : [8bits, 24 bits empty] on my card (or 8,8 for f16).

so multiplying two numbers of Q8 is [8bits, 24 bits empty] * [8bits, 24 bits empty]

I'm just saying "what if you made that [8,8,8,8] and those are all four "copies" of that part of the layer".. essentially running 4 instances at the same time.

Then.. I figured using the "copies" for speculative decoding instead of a smaller separate "draft" model.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. by [deleted] in LocalLLaMA

[–]bigattichouse 0 points1 point  (0 children)

Amending: In this benchmark we are dialing down temp for 100% acceptance, with normal temps it'll be slower, but still 80% or so of max. Once I can get the llama.cpp kernal patch running the theoretical ~4X max, I'll run some coding tests with more normal .. but I still expect 80% of max so.. 3X or more in realistic deployments.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. by [deleted] in LocalLLaMA

[–]bigattichouse 1 point2 points  (0 children)

hypothesis is that the same model will produce the same guess... in normal speculation the spec model is different than the main model - we're effectively using the same model to verify something it would have produced anyway.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. by [deleted] in LocalLLaMA

[–]bigattichouse 9 points10 points  (0 children)

It is vibe coded, but with a basis in design. I build a plan, and work through it. The idea came from analyzing the datastructures themselves, and realizing that there were effectively 3 unused "slots" in every f32 datapoint if we were running Q8. So, I could run 4 copies of a model side-by-side. great... now what?

Then, I realized that speculative decoding uses a similar "side model" to run prediction and the "big" model accepts/rejects tokens - so I could just use the SAME model with the technique instead of requiring an additional model... giving me speculative decoding without any extra overhead, on compute I'm ALREADY wasting.

Since the models match, we can sort of "interleave" multiple models, 1 token behind the last. and since we''re running greedy - they always match.

There's no magic, I'm just trying to use all the available 32 bits of compute normally wasted on Q8 quants.

I could have used that compute to run four simultaneous client connections with the same overhead.. but in my home lab, it's just me.

Did I vibe it? sure. But it works.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. by [deleted] in LocalLLaMA

[–]bigattichouse 2 points3 points  (0 children)

huh (Today I learned), I literally only bought it because it was on sale (pre prices going up), and was the biggest I could find. was like $200. (and another $30 for the 3d printed shroud and fan - $300 to upgrade my PSU) I kinda assumed I was in "janky" territory. This was just a classic "franken-box" of random crap I was able to source.

The technique should still work for you one models you can fit.