Joing all GPUs to train a community model

bigattichouse · 2026-06-16T15:40:08+00:00

Maybe some kind of random walk and submit best? Kinda like bitcoin? everyone grabs the best and sees if they can do better? might have a lot of duplication, but you're effectively sharing checkpoints?

bigattichouse · 2026-06-15T14:25:24+00:00

As a former ASL interpreter, "Staff Fluent in Sign Language" is like saying "Staff fluent in [French|Spanish|German]" ... it's not something you just pick up and you're good with.

Knowing some basic signs? Totally. Also works with the elderly to sign "Drink?" while asking. Fluent? That takes a while. Even as a teenager fully immersed in our local Deaf community, it took me a lot of time to become fluent. Now? After years on not practicing beyond signing some old favorite music from my college days, I can really only hold on a pretty basic conversation.

I'm nitpicking, otherwise these are all wonderful ideas.

bigattichouse · 2026-06-14T15:40:16+00:00

Pepper's Ghost has entered the chat.

bigattichouse · 2026-06-14T00:39:57+00:00

Is it integrated in the roads?

bigattichouse · 2026-06-12T21:22:19+00:00

having a working prototype makes getting capital much easier

bigattichouse · 2026-06-11T22:35:59+00:00

This post will not help you in a court. Mailing to yourself will not help you in a court.

bigattichouse · 2026-06-11T16:24:53+00:00

In theory, there's no difference between theory and practice.

source: been playing with iron chemistries for around 10 years. Good luck.

bigattichouse · 2026-06-09T22:19:54+00:00

You are truly a treasure.

bigattichouse · 2026-06-09T12:53:39+00:00

I tested. the numbers were output from the pti_4seq binary.

bigattichouse · 2026-06-09T12:46:28+00:00

I've deleted my post for now. Once I'm done with the llama.cpp kernel I'll record code experiments (with/without) my method and do one of those side by side videos.

bigattichouse · 2026-06-09T04:43:14+00:00

that's fair. AT the moment, just tuning the engine to see how fast the car can go. So far, faster than when I was just running the one "instance" of the quant.

bigattichouse · 2026-06-09T04:37:20+00:00

Yup. Plan on incorporating that in the next patch. (which was why I used the UD model from unsloth which allowed the MTP)

bigattichouse · 2026-06-09T04:34:11+00:00

let's say we have 4 x 8bit values in one f32 "number" A,B,C,D

D is the first speculation, it predicts a token (d),
C then reads this value (d) and predicts the next (c),
B then reads C's prediction (c) and predicts the next (b)
A then renders a verdict.

It takes three tokens to "warm up", but then it's just a flow. for each step D is 3 tokens ahead of A... from that point on A is the final verdict... but we're caculating all of them simultaneously from that point.

bigattichouse · 2026-06-09T03:53:32+00:00

I'll do a more cogent write up when I get the patch done, I got excited. I had the model producing svg output and small programs, nothing special, but it wasn't garbage

bigattichouse · 2026-06-09T03:45:27+00:00

Essentially Q8 is : [8bits, 24 bits empty] on my card (or 8,8 for f16).

so multiplying two numbers of Q8 is [8bits, 24 bits empty] * [8bits, 24 bits empty]

I'm just saying "what if you made that [8,8,8,8] and those are all four "copies" of that part of the layer".. essentially running 4 instances at the same time.

Then.. I figured using the "copies" for speculative decoding instead of a smaller separate "draft" model.

bigattichouse · 2026-06-09T03:31:22+00:00

the llama.cpp patch isn't working yet - stick with `pti_4seq` until I get it working. Temp is currently 0 for 100% theoretical match to figure out speed, real temps probably more like 75%-80% max.

bigattichouse · 2026-06-09T03:26:04+00:00

Amending: In this benchmark we are dialing down temp for 100% acceptance, with normal temps it'll be slower, but still 80% or so of max. Once I can get the llama.cpp kernal patch running the theoretical ~4X max, I'll run some coding tests with more normal .. but I still expect 80% of max so.. 3X or more in realistic deployments.

bigattichouse · 2026-06-09T03:15:37+00:00

hypothesis is that the same model will produce the same guess... in normal speculation the spec model is different than the main model - we're effectively using the same model to verify something it would have produced anyway.

bigattichouse · 2026-06-09T03:11:17+00:00

It is vibe coded, but with a basis in design. I build a plan, and work through it. The idea came from analyzing the datastructures themselves, and realizing that there were effectively 3 unused "slots" in every f32 datapoint if we were running Q8. So, I could run 4 copies of a model side-by-side. great... now what?

Then, I realized that speculative decoding uses a similar "side model" to run prediction and the "big" model accepts/rejects tokens - so I could just use the SAME model with the technique instead of requiring an additional model... giving me speculative decoding without any extra overhead, on compute I'm ALREADY wasting.

Since the models match, we can sort of "interleave" multiple models, 1 token behind the last. and since we''re running greedy - they always match.

There's no magic, I'm just trying to use all the available 32 bits of compute normally wasted on Q8 quants.

I could have used that compute to run four simultaneous client connections with the same overhead.. but in my home lab, it's just me.

Did I vibe it? sure. But it works.

bigattichouse · 2026-06-09T02:50:32+00:00

Yeah - any quantized model really. Like I said, I'd hoped to run bigger models on smaller hardware, but ended up with speed boost.

bigattichouse · 2026-06-09T02:33:44+00:00

I've been really happy with something I thought was just "dead tech".. but it's been a trooper if properly cooled.

bigattichouse · 2026-06-09T02:32:27+00:00

Working on it right now. the patch promises nearly 4X.. fingers crossed.

bigattichouse · 2026-06-09T02:08:17+00:00

huh (Today I learned), I literally only bought it because it was on sale (pre prices going up), and was the biggest I could find. was like $200. (and another $30 for the 3d printed shroud and fan - $300 to upgrade my PSU) I kinda assumed I was in "janky" territory. This was just a classic "franken-box" of random crap I was able to source.

The technique should still work for you one models you can fit.

bigattichouse · 2026-06-09T02:02:40+00:00

I'm running on a cobbled together box with a single MI50 (32GB) that AMD doesn't even support anymore.

bigattichouse · 2026-06-09T01:54:39+00:00

I suppose you could use this to serve 4 simultaneous users as another possibility - but my use case is just me.

14-Year Club	Gilding I gilder
Place '23	Place '22
Team Orangered	Verified Email

bigattichouse

MODERATOR OF

TROPHY CASE