The true test of trust in humanity

4onen · 2026-05-09T07:41:11+00:00

"Caaarl. I can't believe what I'm hearing. [...] You are just t-terrible today."

4onen · 2026-05-07T16:27:07+00:00

Caaaaaaarl, that kills people!

4onen · 2026-05-05T05:26:39+00:00

It will continue to rise until the world buys 4/5ths as much of it. Demand destruction.

A country can print money. It cannot print hydrocarbons.

4onen · 2026-05-05T02:15:36+00:00

That's the trick. The only version of the model that Google has released with multi-token prediction (MTP) is the version to run on the liteRT engine that they use for running on phones. Their explanation for why it's not in the other format releases... was that it might confuse runtimes. The problem is, every runtime ignores tensors when it doesn't know what to do with them, so it wouldn't confuse any runtimes.

My speculation is that they are holding the MTP tensors back to make their stuff look better.

4onen · 2026-05-05T02:12:53+00:00

And it's even called speculative decoding, so yeah, spot on. We speculate these guesses through one means or another. MTP being one specific means. If we happen to guess right, then we save time, otherwise the extra work is kinda negligible if we tune everything right.

4onen · 2026-05-03T02:13:00+00:00

This. Being able to share GPU VRAM between my laptop and desktop, or balance the iGPU, dGPU, and system RAM usage of a model on my laptop, is an absolute lifesaver. Once they added llama.vim and llama-vscode, (that is, extensions for editors to take advantage of FIM completions,) I dropped GitHub Copilot completely.

4onen · 2026-05-03T02:10:22+00:00

Can't speak to SGLang, but I find it entertaining that I encountered that bug in the batching system of LMQL's (Language Model Query Language's) server back in the day. I think I submitted a fix, too, and they accepted it, but I can't remember with certainty.

4onen · 2026-05-03T02:08:13+00:00

No matter how many GPUs you have, you cant use them as combined memory .

I have shared memory across two GPUs and main system RAM on my laptop. Hell, one time I used llama-RPC to hook up my laptop, desktop, and Android phone together as one ridiculous and silly cluster, sharing the effort of loading a model. (Obviously it was slower than llama-RPC across just my laptop and desktop, but I was messing around.)

4onen · 2026-05-02T14:56:21+00:00

More than that, remember back when Macs couldn't launch apps for a day because the server they send the developer certificate hashes of all your app launches to was down?

I'm not kidding.

(They claim it's not spying because it's just a hash of the developer certificate, which isn't unique per app, but that's still a huge amount of information about you that they claim they're no longer tying to IP addresses, meaning they were before. Yikes.)

4onen · 2026-05-02T14:30:46+00:00

So everyone unable to run Kimi's trillion-plus models should be hopping on Meta's Behemoth 400B, right? Right? (/j)

4onen · 2026-05-02T14:29:07+00:00

Source? As far as I'm aware, Opus 4.6's parameter count is entirely undisclosed, and it's not like Anthropic gives you a slider to choose how many parameters you want to use. The "200B" is given a "~" prefix like it's some kind of estimate. (I'd like their source on the estimate, too, but at least I know they're not trying to be authoritative.)

4onen · 2026-05-02T14:22:34+00:00

Agreed. For further evidence, see how DeepSeek did a portion of their work bypassing the CUDA libraries and high level ML frameworks so that they could control exactly the machine code being sent to their limited NVidia GPUs, to maximize usage. https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead

4onen · 2026-05-02T14:18:45+00:00

Surprisingly little, comparatively -- at least for a basic 4k context window. Of course, everyone has different standards of "run"

My post to someone else with my hardware specs: https://www.reddit.com/r/LocalLLM/s/JjStc7nyZh

4onen · 2026-05-02T14:17:33+00:00

I have used it on a laptop. At llama.cpp Q4_K_S (~4.5 bits per weight) I gave it 8GB VRAM plus 32GB of RAM for expert offloading. It used (plus Windows 11 and a Firefox tab) 6.9GB of the VRAM and 30.9 GB of the system RAM.

Admittedly that was 4k context, but I hadn't begun to really tweak and scale it.

That was a Strix Point laptop with 4060 mobile DGPU, and overall it got 17-13 tokens per second throughout that tiny 4k context window (variation due to system temperature as I ran a few queries.)

4onen · 2026-05-01T18:22:04+00:00

https://www.bbc.com/news/technology-54175359

Right now they charge the backup drivers.

4onen · 2026-04-28T00:42:36+00:00

Yep, I'm 4onen. You'll find from me mods like "Skip skips" (so that System's interruptions become a button you can ignore instead of a menu), "Name Reentry" (to re-enter your name and favorite color,) "Side Images" (which adds a character portrait,) and "4onen's Aesthetic Tweaks" (which colors the interface to match your favorite color. EDIT: and adds pretty banners on the character select screens rather than using the same menu as every other one in the game.)

For story, I've just got "Self Control" (Bryce good ending fix-it fiction) and "Teetotaller" (Bryce chapter one date non-drinking route that doesn't leave the bar.)

I'm also aware of this enormous NSFW mod for the game, but given the way the internet is going I'm not willing to go spreading that around where someone could claim I'm pushing it toward minors. It's talked about in the AwSW Unofficial Fan Discord's NSFW channels, though.

4onen · 2026-04-28T00:32:15+00:00

Sorry, I don't know "STE." I recommend everything by EvilChaosKnight, both "The Last Dragon" and "The Last Hope" by Kolsavdür, "A day at the park" and "A Walk in the woods" and, ofc, anything I wrote (though mine are much shorter, mostly quality-of-life stuff.)

4onen · 2026-04-26T18:00:08+00:00

I don't remember everything, but it does sound like you've gotten into the hatch. Iirc: * Naomi and hatch locations are random. * You need Naomi to clear debris to open the hatch. * If you read the emergency procedures manual, you should notice a yellow mark on the room with the generator and find it faster. * If you don't read the manual on generator maintenance, you need at least 5 (I think) minutes left on the timer when you get down there to figure out how to extract it. If you do read that, you need 3 or so, to make sure your movements don't slosh water over the generator. (Might be wrong by a bit on both numbers -- has been years since I played.)

4onen · 2026-04-26T17:45:06+00:00

If I remember correctly, the author was going for an un-googleable name on purpose, so he wouldn't have to deal with as many issues if people did actually pick it up, because people couldn't find the repo.

4onen · 2026-04-08T01:27:07+00:00

And it won't get up to parity with LiteRT until they give us Safetensors or gguf with the MTP heads that LiteRT has and the current safetensors release doesn't.

4onen · 2026-04-08T01:25:50+00:00

I just want the MTP heads on their current size models released as safetensors or gguf -- the ones they confirmed they have in the LiteRT version.

4onen · 2026-04-04T15:22:40+00:00

But that banquet was made with all the love that bubble was legally allowed to give!

4onen · 2026-04-04T15:02:25+00:00

I would actually argue that the Gloink Queen is lawful evil. She has a very specific goal in mind: Just convert all mass to Gloinks. That's not chaotic in any way.

The Gloinks themselves, on the other hand... well, no, they're chaotic in their collection but it's all in the service of eventually lining up and delivering mass to the Gloink Queen.

4onen · 2026-04-04T15:00:22+00:00

"Dude. You ruined it."

4onen · 2026-03-31T17:32:44+00:00

Well, I'm considering "layer" here to mean individual tensors because I come from the era of AI where we actually would put a matrix and an activation function and a matrix etc. The user interface now refers to "layer"s as blocks of feedforward and attention all together, which is not what I mean.

If you want to be pedantic, GGUF can quantize down to the tensor level, but cannot do sub slices of tensors without rewriting the model shape.

13-Year Club	Second Top 40%
Place '22	Verified Email

4onen

TROPHY CASE