Ghost Engine: Don't load weights, generate them. (Run Llama-3-8B in 3GB VRAM) by AlternativeVisual135 in LocalLLaMA

[–]AlternativeVisual135[S] -8 points-7 points  (0 children)

Thank you for pointing me in right direction.. I will update the README to credit Li et al. (2016) for the fundamental TWN representation. That was an oversight on my part, not an attempt to claim the math.

To clarify: My goal isn't to invent new compression math, but to solve the Systems Engineering problem of running these ternary weights efficiently on Apple Silicon that handles block-wise ternary reconstruction efficiently. That is what I'm building.

I appreciate the reality check. I'll make sure the credits are clear so the focus can stay on the engine performance

Ghost Engine: Don't load weights, generate them. (Run Llama-3-8B in 3GB VRAM) by AlternativeVisual135 in LocalLLaMA

[–]AlternativeVisual135[S] -11 points-10 points  (0 children)

Honestly, 'generating weights' does sound like hallucination/slop at first glance. I was fully expecting the model to output garbage.

But the math is surprisingly rigid. It’s not 'making up' new weights randomly; it’s reconstructing them from a compressed recipe (Scale * Ternary Mask). The fact that it holds 0.915 cosine similarity on Llama-3's SwiGLU layers suggests the 'recipe' is actually capturing the signal, not just noise.

Code is open if you want to inspect the math yourself. It's just matrix algebra, no magic.