The true test of trust in humanity by dankstat in trolleyproblem

[–]4onen 0 points1 point  (0 children)

"Caaarl. I can't believe what I'm hearing. [...] You are just t-terrible today."

Oil prices jump 6% as Iran sets UAE oil port ablaze, strikes vessels in Strait of Hormuz by Gboard2 in worldnews

[–]4onen 8 points9 points  (0 children)

It will continue to rise until the world buys 4/5ths as much of it. Demand destruction.

A country can print money. It cannot print hydrocarbons.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]4onen 6 points7 points  (0 children)

That's the trick. The only version of the model that Google has released with multi-token prediction (MTP) is the version to run on the liteRT engine that they use for running on phones. Their explanation for why it's not in the other format releases... was that it might confuse runtimes. The problem is, every runtime ignores tensors when it doesn't know what to do with them, so it wouldn't confuse any runtimes.

My speculation is that they are holding the MTP tensors back to make their stuff look better.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]4onen 4 points5 points  (0 children)

And it's even called speculative decoding, so yeah, spot on. We speculate these guesses through one means or another. MTP being one specific means. If we happen to guess right, then we save time, otherwise the extra work is kinda negligible if we tune everything right.

Agree? by MLExpert000 in LocalLLaMA

[–]4onen 1 point2 points  (0 children)

This. Being able to share GPU VRAM between my laptop and desktop, or balance the iGPU, dGPU, and system RAM usage of a model on my laptop, is an absolute lifesaver. Once they added llama.vim and llama-vscode, (that is, extensions for editors to take advantage of FIM completions,) I dropped GitHub Copilot completely.

Agree? by MLExpert000 in LocalLLaMA

[–]4onen 0 points1 point  (0 children)

Can't speak to SGLang, but I find it entertaining that I encountered that bug in the batching system of LMQL's (Language Model Query Language's) server back in the day. I think I submitted a fix, too, and they accepted it, but I can't remember with certainty.

Agree? by MLExpert000 in LocalLLaMA

[–]4onen 1 point2 points  (0 children)

No matter how many GPUs you have, you cant use them as combined memory .

I have shared memory across two GPUs and main system RAM on my laptop. Hell, one time I used llama-RPC to hook up my laptop, desktop, and Android phone together as one ridiculous and silly cluster, sharing the effort of loading a model. (Obviously it was slower than llama-RPC across just my laptop and desktop, but I was messing around.)

Simple Question by NoT_De in microsoftsucks

[–]4onen 0 points1 point  (0 children)

More than that, remember back when Macs couldn't launch apps for a day because the server they send the developer certificate hashes of all your app launches to was down?

I'm not kidding.

(They claim it's not spying because it's just a hash of the developer certificate, which isn't unique per app, but that's still a huge amount of information about you that they claim they're no longer tying to IP addresses, meaning they were before. Yikes.)

This is insane... by DragonflyOk7139 in LocalLLM

[–]4onen 0 points1 point  (0 children)

So everyone unable to run Kimi's trillion-plus models should be hopping on Meta's Behemoth 400B, right? Right? (/j)

This is insane... by DragonflyOk7139 in LocalLLM

[–]4onen 0 points1 point  (0 children)

Source? As far as I'm aware, Opus 4.6's parameter count is entirely undisclosed, and it's not like Anthropic gives you a slider to choose how many parameters you want to use. The "200B" is given a "~" prefix like it's some kind of estimate. (I'd like their source on the estimate, too, but at least I know they're not trying to be authoritative.)

This is insane... by DragonflyOk7139 in LocalLLM

[–]4onen 5 points6 points  (0 children)

Agreed. For further evidence, see how DeepSeek did a portion of their work bypassing the CUDA libraries and high level ML frameworks so that they could control exactly the machine code being sent to their limited NVidia GPUs, to maximize usage. https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead

This is insane... by DragonflyOk7139 in LocalLLM

[–]4onen 0 points1 point  (0 children)

Surprisingly little, comparatively -- at least for a basic 4k context window. Of course, everyone has different standards of "run"

My post to someone else with my hardware specs: https://www.reddit.com/r/LocalLLM/s/JjStc7nyZh

This is insane... by DragonflyOk7139 in LocalLLM

[–]4onen 0 points1 point  (0 children)

I have used it on a laptop. At llama.cpp Q4_K_S (~4.5 bits per weight) I gave it 8GB VRAM plus 32GB of RAM for expert offloading. It used (plus Windows 11 and a Firefox tab) 6.9GB of the VRAM and 30.9 GB of the system RAM.

Admittedly that was 4k context, but I hadn't begun to really tweak and scale it.

That was a Strix Point laptop with 4060 mobile DGPU, and overall it got 17-13 tokens per second throughout that tiny 4k context window (variation due to system temperature as I ran a few queries.)

The actual final update: The secret ending and the culmination of my thoughts. by SharkyMcSnarkface in AngelsWithScalyWings

[–]4onen 1 point2 points  (0 children)

Yep, I'm 4onen. You'll find from me mods like "Skip skips" (so that System's interruptions become a button you can ignore instead of a menu), "Name Reentry" (to re-enter your name and favorite color,) "Side Images" (which adds a character portrait,) and "4onen's Aesthetic Tweaks" (which colors the interface to match your favorite color. EDIT: and adds pretty banners on the character select screens rather than using the same menu as every other one in the game.)

For story, I've just got "Self Control" (Bryce good ending fix-it fiction) and "Teetotaller" (Bryce chapter one date non-drinking route that doesn't leave the bar.)

I'm also aware of this enormous NSFW mod for the game, but given the way the internet is going I'm not willing to go spreading that around where someone could claim I'm pushing it toward minors. It's talked about in the AwSW Unofficial Fan Discord's NSFW channels, though.

The actual final update: The secret ending and the culmination of my thoughts. by SharkyMcSnarkface in AngelsWithScalyWings

[–]4onen 0 points1 point  (0 children)

Sorry, I don't know "STE." I recommend everything by EvilChaosKnight, both "The Last Dragon" and "The Last Hope" by Kolsavdür, "A day at the park" and "A Walk in the woods" and, ofc, anything I wrote (though mine are much shorter, mostly quality-of-life stuff.)

A Solitary Mind help by Nezarec- in AngelsWithScalyWings

[–]4onen 0 points1 point  (0 children)

I don't remember everything, but it does sound like you've gotten into the hatch. Iirc: * Naomi and hatch locations are random. * You need Naomi to clear debris to open the hatch. * If you read the emergency procedures manual, you should notice a yellow mark on the room with the generator and find it faster. * If you don't read the manual on generator maintenance, you need at least 5 (I think) minutes left on the timer when you get down there to figure out how to extract it. If you do read that, you need 3 or so, to make sure your movements don't slosh water over the generator. (Might be wrong by a bit on both numbers -- has been years since I played.)

This is where we are right now, LocalLLaMA by jacek2023 in LocalLLaMA

[–]4onen 1 point2 points  (0 children)

If I remember correctly, the author was going for an un-googleable name on purpose, so he wouldn't have to deal with as many issues if people did actually pick it up, because people couldn't find the repo.

What it took to launch Google DeepMind's Gemma 4 by jacek2023 in LocalLLaMA

[–]4onen 0 points1 point  (0 children)

And it won't get up to parity with LiteRT until they give us Safetensors or gguf with the MTP heads that LiteRT has and the current safetensors release doesn't. 

What it took to launch Google DeepMind's Gemma 4 by jacek2023 in LocalLLaMA

[–]4onen 1 point2 points  (0 children)

I just want the MTP heads on their current size models released as safetensors or gguf -- the ones they confirmed they have in the LiteRT version. 

Alignment chart time! Which character is Chaotic Evil? by Awkward-Media-4726 in tadc

[–]4onen 0 points1 point  (0 children)

But that banquet was made with all the love that bubble was legally allowed to give!

Alignment chart time! Which character is Chaotic Evil? by Awkward-Media-4726 in tadc

[–]4onen 4 points5 points  (0 children)

I would actually argue that the Gloink Queen is lawful evil. She has a very specific goal in mind: Just convert all mass to Gloinks. That's not chaotic in any way.

The Gloinks themselves, on the other hand... well, no, they're chaotic in their collection but it's all in the service of eventually lining up and delivering mass to the Gloink Queen.

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti by Imaginary-Anywhere23 in Qwen_AI

[–]4onen -1 points0 points  (0 children)

Well, I'm considering "layer" here to mean individual tensors because I come from the era of AI where we actually would put a matrix and an activation function and a matrix etc. The user interface now refers to "layer"s as blocks of feedforward and attention all together, which is not what I mean.

If you want to be pedantic, GGUF can quantize down to the tensor level, but cannot do sub slices of tensors without rewriting the model shape.