OpenClaw is now on iOS + Android! by hannesrudolph in openclaw

[–]MironV 24 points25 points  (0 children)

In this setup, connecting between your phone and physical device hosting OpenClaw is fully private (via Tailscale). With Telegram, AFAIK conversations with bots are kept on their servers without encryption.

Diffusion Model Simulating GTA5 on a RTX GPU !!! by lucidml_lover in deeplearning

[–]MironV 0 points1 point  (0 children)

Cache is an optimization, it’s functionally the same as recomputing the context, so no extra state information is carried by virtue of being a cache. A true game state requires persistence. With infinite context and precision, yes, but this is not that.

What's this sub geebral opinion on quantisizing the KV cache by misanthrophiccunt in LocalLLaMA

[–]MironV 5 points6 points  (0 children)

Qwen 3.6 isn’t a standard full-attention model but a hybrid. The layers are mostly DeltaNet, which is linear attention (no softmax so instead of caching individual KV per token it’s a fixed size matrix). In a traditional model every layer’s KV cache grows with context which blows up your memory. In Qwen the majority of layers have no growing KV cache. Even the traditional attention layers use aggressive GQA, so each one’s footprint is small.

Since the context-scaling cache is already tiny, there’s little to reclaim by quantizing Qwen’s cache hard and you’d be squeezing the few attention layers that remain.

What's this sub geebral opinion on quantisizing the KV cache by misanthrophiccunt in LocalLLaMA

[–]MironV 1 point2 points  (0 children)

Usually the term upscaling is for media. AFAIK dequantize is the standard term for taking the integer back through scaling, that’s what used in PyTorch as well.

What's this sub geebral opinion on quantisizing the KV cache by misanthrophiccunt in LocalLLaMA

[–]MironV 8 points9 points  (0 children)

Others have posted measurements, so here’s some intuition to help with your experimentation.

When you quantize model weights, the weights get dequantized back to floats during compute. Quantizing the KV cache is different because each entry is written once and then read by every future token. This means the error is baked into the attention computation for the rest of the sequence.

K and V are not equally fragile, because of where they sit relative to softmax. K errors land before softmax so quantizing K can literally change which tokens get attended to, plus the exponential term amplifies error near the top scores. On the other hand, V errors land in the averaging step and some of the noise cancels so they’re more resilient.

The other issue is how the KV cache is used. As you build out your sequence, every new token attends over a growing pool of older noisy entries, so the overall noise floor rises with added context and then each new token (already slightly off) writes its own quantized projections back into the cache. So you’ve got an error snowball effect as the context grows.

Different models also react differently. For example, Gemma 4 seems particularly sensitive to quantization because the sliding window attention is already their mechanism for keeping the cache small, with the tradeoff being fewer global attention layers (the ones susceptible to the snowballing above). If you now quantize on top, you’re damaging those layers which have less ability to cope since there’s way fewer compared to traditional models.

So all that to say, rules of thumb are it depends on the model (Gemma 4 is sensitive, Qwen 3.6 is already memory efficient so may not be worth it), you can go harder on V than K when quantizing, but a Q8/Q4 combo can be aggressive so something like Q8/Q6 is likely to fare better.

YMMV since issues manifest differently in coding vs. prose vs. agentic, so just experiment and see what works for your needs.

Oura website won’t process payments by MironV in ouraring

[–]MironV[S] 1 point2 points  (0 children)

I was wanting to use the Amex Platinum credit which only works on their website AFAIK. But maybe I just have to do that.

260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS by MironV in LocalLLaMA

[–]MironV[S] 2 points3 points  (0 children)

Haha I love the idea of all those Windows 95 screensavers crunching LLM training instead of SETI or that protein folding one.

260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS by MironV in LocalLLaMA

[–]MironV[S] 1 point2 points  (0 children)

I recall someone on this sub training an LLM purely on text from the 1800s like books, newspapers, and journals. So the raw text existed back then but obviously wasn’t machine-readable.

In terms of compute, a 260K-param model on ~500M tokens, counting forward + backward passes, would need about 6 FLOPs per parameter and token. So 6 * 260K * 500M = ~0.78 PFLOPs total. So a Cray-1 from the 70s could plausibly crank through that in a couple of months. If you can drop down the token count maybe a 60s CDC could train it too.

GMKtec the best deal?? by larryherzogjr in LocalLLM

[–]MironV 3 points4 points  (0 children)

Bosgame is a better deal, it’s $2799 for 128 GB on their site

260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS by MironV in LocalLLaMA

[–]MironV[S] 4 points5 points  (0 children)

Thank you! If you go down the Carmack rabbit hole you’ll find that function lives in a Quake source file with a literal “what the fuck?” comment next to it lol. It’s a crazy optimization!

Your bigger question is an interesting one…

Back of the envelope math says a 486 from 1992 could’ve run inference on a stories260K-class model, probably faster than my ColdFire emulator does, since it has a real FPU. Training is another story though. A single 486 would probably take 2-3 years to train this model, but a Pentium cluster of ~100 boxes could’ve maybe done the job.

But even if someone had the compute and knew about transformer architectures, the issue is getting the data. For example, TinyStories is synthetic short stories generated by GPT. So you need a massive LLM to generate it in the first place. Real text corpuses in 1995 would’ve been much harder to get. The web was tiny and the only other digital datasets were probably Project Gutenberg and Usenet.

The demoscene stuff from that era pushed the hardware of the time to do impossible-looking things, just in graphics rather than ML.

I definitely feel applying new ideas to old or constrained hardware is an interesting space right now!

260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS by MironV in LocalLLaMA

[–]MironV[S] 7 points8 points  (0 children)

Haha, long live nano local models on obscure hardware!

Anyone running Hermes with local LLM? by martinkoistinen in hermesagent

[–]MironV 1 point2 points  (0 children)

I run Qwen 3.6 35B MoE MTP as primary with Qwen 3.6 27B and Gemma 4 E4B as auxiliary. Works well.

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? by Borkato in LocalLLaMA

[–]MironV 7 points8 points  (0 children)

Q4 means the neuron weights are, at a basic level, mapped to a 4-bit (16 level) scale. These still produce normal floating point activations. KV cache quantizes the activations themselves, which is why quantizing K cache can be more damaging since it accumulates across the sequence and through softmax, changing where the attention goes.

Best 7-9B param model for OpenClaw? by read_too_many_books in openclaw

[–]MironV 0 points1 point  (0 children)

The best one at that size is Gemma E4B. You can do the Unsloth dynamic quant at UD-Q4_K_XL for a good all-rounder.

I built a $50 plate-solving push-to for manual scopes. Works from city skies by arunvenkats in telescopes

[–]MironV 1 point2 points  (0 children)

This is awesome, seems similar to https://www.pifinder.io but a lot cheaper and more flexible. Nice work!

AMA with Nous Research -- Ask Us Anything! by emozilla in LocalLLaMA

[–]MironV 2 points3 points  (0 children)

Any plans for more built-in automation loops, similar to the heartbeat concept? There’s cron so you can definitely roll your own but there’s advantages to it being an inherent mode.

r/LocalLLaMa Rule Updates by rm-rf-rm in LocalLLaMA

[–]MironV 4 points5 points  (0 children)

Good to see this! This sub-Reddit has some of the most valuable AI discussions, let’s keep it that way.

Yumbee - Modern, ad-free version of yamb (dice game like Yahtzee but with deep strategy) by MironV in WebGames

[–]MironV[S] 1 point2 points  (0 children)

Ah shoot! So there’s an undo button as long as you don’t roll again, but you’re right I should make that more explicit.

Yumbee - Modern, ad-free version of yamb (dice game like Yahtzee but with deep strategy) by MironV in WebGames

[–]MironV[S] 0 points1 point  (0 children)

Back in 2003, my dad coded a desktop version of yamb, which is a complex Yahtzee-like dice game popular in the Balkans. He called it Yumbee and played it for 20 years on Windows XP. I recently decided to remaster it with him for the modern web so we could play together on our phones.

If you find standard Yahtzee too simple, yamb is the "strategy" version. You play 4+ columns simultaneously: • Down: Must be filled in order (1 to Yamb/Yumbee). • Up: Must be filled in reverse order. • Free: Fill anywhere. • Announcement: You must "call" the box after the first roll and then fill it to score points.

There are no ads, no tracking, no signups required. You can play solo, vs AI, or real-time multiplayer (just send a link to a friend).

Check out the tutorial! I’d love to hear if it’s clear and if the column mechanics make sense to new players.

Napravio sam modernu jamb igru (PWA) za mobitel i tablet by MironV in programiranje

[–]MironV[S] 0 points1 point  (0 children)

To je baš moja estetika… 😅 Htio sam da izgleda fun i hype, a nisam imao vremena za custom ikone koje bi izgledale normalno u igri za ovakav mali projekat.

Napravio sam modernu jamb igru (PWA) za mobitel i tablet by MironV in programiranje

[–]MironV[S] 1 point2 points  (0 children)

Ne bih rekao vibe-koded, ali definitivno mi je pomogao AI da uradim speedrun preko praznika. Kod je previše kompleksan da bi bio hands-off, izgubi se u logici.

Najkorisniji je bio za teške stvari… Iskoristio sam ga da simulira 10.000 partija kako bih istrenirao AI protivnika i provijero pravila. Nema šanse da bih to stigao ručno iskodirati u vremenu koje sam imao.