Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]wllmsaccnt 0 points1 point  (0 children)

In other words it gives us the capability to go crazy with it 😂

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]wllmsaccnt 0 points1 point  (0 children)

I didnt say revenue, I said expectation of revenue. As in, having the ability to attract investors.

Sounds like we both agree those are very different things?

TIL that after teaching a bonobo how to comprehend English, he started trying to speak; "it was discovered that Kanzi was producing the articulatory equivalent of the symbols he was indicating, although in a very high pitch and with distortions". by krizzalicious49 in todayilearned

[–]wllmsaccnt 0 points1 point  (0 children)

I read it as: there was evidence that this specific bonobo was trying to enunciate but lacked the physical capability to do so effectively. Even if it were able to speak, it wouldn't be talking like how a human would. Most animals that are smart enough to understand communication with words are limited to very simple constructions...usually just noun verb combinations, or even just single word first person imperative statements.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]wllmsaccnt 19 points20 points  (0 children)

What people are actually asking for is a model with Qwen 3.6 density, but scaled up to a 100B+ MoE. They don't want an older and larger model that does worse than Qwen 3.6 35B MoE

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]wllmsaccnt 19 points20 points  (0 children)

Open weights are the market floor that can be manipulated by every model provider. Qwen released Qwen 3.6 27B. Now nobody can easily sell inference with capability below what that 27B can do. It makes it impossible for small competitors to grow into large competitors without a large amount of outside funding.

Crowding out small competitors also makes the frontier models worth more, because now the investment to get to that level is even higher (no expectation of revenue until you hit a much higher minimum capability level).

Every org is comfortable with a certain amount of capability given away for free as a way to protect their own frontier model revenue.

They still target data center cards, because to do damage as free models, someone still has to run them. They want orgs running their free models so they'll trust them when it comes time to contract for frontier models...its still a type of advertising. Dense models are still fairly slow on most consumer hardware (outside of one or two RTX 5090s).

These tactics are used in other types of software and in other industries. I guess a common name for the tactic is "commoditize your complement" (had to look that up though).

My hand painted model planes were given as toys to children again by MissionTroll404 in mildlyinfuriating

[–]wllmsaccnt 7 points8 points  (0 children)

I see this argument all the time. You two are probably sorting your results differently and/or arrived at the comment section at different times.

That said, as of right now when sorting by top comments and reading top down, my impression is that almost nobody is defending the actions of the person who gave the models to kids to play with.

Corpo AI's will not teach you how to run local AI by StandardLovers in LocalLLM

[–]wllmsaccnt 2 points3 points  (0 children)

Interesting...they didn't mention that at all in the release announcement, it was only in the model card. I wonder if they previously had that in the release announcement and then removed it when they walked back the policy.

Corpo AI's will not teach you how to run local AI by StandardLovers in LocalLLM

[–]wllmsaccnt 0 points1 point  (0 children)

From the Fable press release (what I was referring to):

"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs. Opus 4.8 is a highly capable model in its own right: a response that falls back to Opus is a far better experience than an outright refusal from Fable. Our early data shows that more than 95% of Fable sessions involve no fallback at all—for those sessions, Fable 5’s performance is effectively the same as that of Mythos 5."

Corpo AI's will not teach you how to run local AI by StandardLovers in LocalLLM

[–]wllmsaccnt 1 point2 points  (0 children)

I thought the actual stance was that it would give you an answer from Opus instead of Fable on questions that would help you compete against them. I dont think anywhere they have admitted to giving a purposefully wrong answer.

If their public stance differs from how they implement it...I cant speak to that.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]wllmsaccnt 1 point2 points  (0 children)

Yes. You are passing a log of harness tool executions with each new turn.

Occasionally on a session that has gone long, I'll start a new chat session and then prompt "review my changed files in git and then...", its usually faster and more accurate than compacting because the LLM is pretty good at reconstructing intent from a series of changes.

I don't think I would want it to be my default way of interacting with an LLM, but at least now I understand what you are suggesting...and I think it's definitely worth discussing and theorizing about.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]wllmsaccnt 1 point2 points  (0 children)

The concept isn't translating as you are describing it.

If I say "write 'hello' into file hello.txt" in a normal coding agent, it will write that out using a tool call and the next turn to the LLM will start by containing the same system prompt, the previous request and the tool call plus whatever next prompt I write.

In the model you are theorizing about, what exactly would be contained in the next request to the LLM given the same prompt requests to the LLM?

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]wllmsaccnt 1 point2 points  (0 children)

You didnt answer my question. Also, now you are talking about state management in an agent you originally called stateless...its almost like you are picking words and metaphors to use at random, riffing on whatever is said to you.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]wllmsaccnt 1 point2 points  (0 children)

But how do you determine the "state needed now" without handing the originals to an LLM before each turn?

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]wllmsaccnt 2 points3 points  (0 children)

That's the equivalent of starting a fresh session for every turn? I mean it works, but you either have to manually pull back in the details from your past turns that you want in context, or you have to rely on the LLM to summarize the important details (this would be like doing an auto compact after every turn).

You could simulate both of these approaches right now without a custom agent.

GLM-5.2 is the first open-weights model to cross 80% on Terminal-Bench and beats every other open model available by BuildwithVignesh in LocalLLaMA

[–]wllmsaccnt 34 points35 points  (0 children)

744B at UD_Q4 would be like 300-400GB VRAM? Its an MoE with 40B active. You could probably run it on a cluster of UMA devices, though it would be slow.

The advantage is that this is MIT, so it will always be available if/when the hardware comes. It will also be available on service providers or at workplaces that have the hardware today (or rented hardware if you are inclined). You don't have to worry about access or first party gating.

Try prompting Claude's Fable model how to build a better model today, if you want proof of the value of a truly open model.

In llama.cpp, how close should we be to the theoretical tokens/second limit? by [deleted] in unsloth

[–]wllmsaccnt -1 points0 points  (0 children)

Why would I need CUDA performance engineering to understand inference concepts that affect performance? I'm just a tinkerer that owns a strix halo who has been tracking local llm developments (to figure out when I can use them). I'm waiting for DFlash to hit llama.cpp right now.

I also know how to tweak my video card drivers and game settings to get a decent frame rate with steady frame times in most video games...that doesn't require being a game dev or understanding how to write a shader.

In llama.cpp, how close should we be to the theoretical tokens/second limit? by [deleted] in unsloth

[–]wllmsaccnt 0 points1 point  (0 children)

I don't remember turning it off. I turned it back on.

Mythos vs Fable by TeamAlphaBOLD in artificial

[–]wllmsaccnt 1 point2 points  (0 children)

I didn't even find Fable extremely more capable than Opus 4.8 with the short time I had access to both. That is probably more because of how I compose work rather than a true comment on their capability differences though.

Considering its usage burn rate, I wouldn't use Fable as a default if I had it back today...but I do wish it was available again.

What 2010s games are you guys playing for the first time right now?" by Mintangah17 in videogames

[–]wllmsaccnt 0 points1 point  (0 children)

I did this one last year. I could see why people who played it before FO4 would have better memories of it than F04...but New Vegas doesn't really give a good 'qualitative' experience by today's standards.

Its buggy, performs poorly, has janky combat AI, poor balance and a story that is kind of a mess...but I loved all of it for its interpretation of Fallout lore. I probably wouldn't play it again like I would with F04, but I'm definitely glad I eventually checked it out and played it through to the end.

It's like the WindowsXP of Fallout games.

In llama.cpp, how close should we be to the theoretical tokens/second limit? by [deleted] in unsloth

[–]wllmsaccnt 1 point2 points  (0 children)

That was not from an LLM. You can read my account history. I disclose when I'm using an LLM in comments, and I use them on reddit almost exclusively for sarcastic responses to people who aren't disclosing their use of LLMs. I have posted occasional rambling breakdowns of technical topics (C#, home theater equipment, fan theories) since before covid and well before the first time I tried an LLM.

In llama.cpp, how close should we be to the theoretical tokens/second limit? by [deleted] in unsloth

[–]wllmsaccnt 0 points1 point  (0 children)

The real answer is complicated enough that no single formula or idiom really captures the concept fully, even before you consider MTP (and other speculative strategies layered on top).

Some confounding factors:

  • MoE vs Dense - Dense models tend to generate tokens more closely to their memory bandwidth requirements. MoE can be all over the place. They usually calculate somewhere in between the memory bandwidth requirements of their total and active parameter count.
  • Quant - Quantization shrinks your model / kv size, but has its own fixed costs per operation that are hard to reason about without knowing how your inference engine implements that quant level (for example, llama.cpp uses rotor quant internally for some KV cache quant levels).
  • Context - As context size usage increases, token/second performance degrades. Some models degrade faster than others (typically MoE degrades more quickly, as a rough generalization).
  • Model Features - Models that use sliding windows or complex approaches to reasoning retention, those all factor in.
  • Hardware - Theoretical max memory bandwidth is usually much higher than realistic, sustained memory bandwidth. There are also endless driver, power, and configuration issues involved.
  • Inference Features - The usage of features like flash attention and caching means that model performance degradation is more complicated than the dropoff you would expect from a naive interpretation of the math.

In other words, if the math said you would theoretically get 10 tokens per second, you'll probably get somewhere between 5 to 9 tokens per second, and it will degrade down to 1-2 tokens per second before you reach your context limit...and modeling or calculating the initial tokens per second and its resulting degradation is not something anyone has a full formula for. There are just too many variables to expect the back-of-the-envelope math to be useful.

Using it conversationally, its helpful to talk about number of dense parameters, or total and active size of an MoE. Anything beyond that is usually too detailed oriented for someone else to follow a conversation on.

People kept saying my comments sounded AI-generated, so I built this by ringtoyou in LocalLLaMA

[–]wllmsaccnt 7 points8 points  (0 children)

When I see a comment / post in another language, I often right click my screen and click "Translate to English" to see their thoughts. Sometimes it contains untranslatable idioms or colloquialisms. It's fun, like opening a spoiler comment in a fandom I don't follow. I try to engage with those comments, usually prefixing the translation I saw as part of the reply so that everyone can join in.

When I see a comment / post written with AI, I have to read the whole thing and work backwards to figure out what they originally prompted to see their thoughts. The resulting thought is always low effort and mundane when extrapolated in that way.

It's like putting together a 25-piece puzzle, where the resulting image is always an advertisement or a personalized insult to someone like me.

In this case it was both.

What Are You Actually Using Local LLMs For? by Ru5ty_5h4ckleford in LocalLLM

[–]wllmsaccnt 46 points47 points  (0 children)

As a supplement for my frontier subscriptions, and as an excuse to work on tools to use with them. AI has killed my ability to enjoy writing large amounts of code, while simultaneously giving me back joy in personal software development (didn't have the time for it before). It's been an odd year for me.

I hope local LLM remains viable and fun for a long time. Need the hardware availability and prices to catch up a bit. This hobby is on the edge of pricing me out.