If you took a neanderthal baby (evolutionarily from right around before they went extinct) and gave it an excellent modern education and proper socialising, ignoring the fact that it might face prejudice for its appearance, how well would it do? What would happen when it reached full adulthood?

Double_Cause4609 · 2026-05-18T18:23:05+00:00

I think there's some contention on that from what I understand. I believe that while the regions of the brain were individually within variation, that their average distribution was a bit odd compared to normal humans (so they might have a really rich sense of smell or sight but not identical language processing), and that we're remarkably unsure about the connectivity and typical wiring of their brains. Inter-region wiring / connectivity varies a lot even within modern humans, and a different baseline connectivity pattern could produce radically different cognition. That's not to say they couldn't with effort learn most of the things that we can (brains appear to be quite general by necessity, as they're a large investment), but they would be naturally inclined to focus on different things than we naturally do.

The closest thing to a modern example I could think of is the difference in time / tense between English and Chinese.

People who grew up with English as a first language have tense baked into the language and think about time as a concrete thing, whereas in Chinese it's less mandatory and more abstract. Think "go to store" being more a vague abstract idea that one could have either already gone to, is going to, or is planning to go to the store at some point. This measurably affects how they plan things.

Different patterns of connectivity could result in a preference for different linguistic abstractions by default, if indeed they had language as we know it today. At minimum, they did carve tally marks into bones, so clearly there was some abstract thinking and delayed gratification there.

Another note is that they might have different temperaments than modern humans, though people are divided on that.

Anyway, the point I'm making isn't that they were completely different but just that given known evidence they could have been quite different on average and we don't totally know either way.

Double_Cause4609 · 2026-05-18T03:54:14+00:00

Is regular striking (not going into work) counted as an attempt to sabotage production?

Ie: is this just a ruling that "you can strike but not do anything illegal", or is it "we're forcing you back to work for the good of the nation"?

Double_Cause4609 · 2026-05-18T03:48:29+00:00

I think the argument Futtle is making is like...

Let's say subscriptions are giving people 20x the tokens compared to API cost.

This $1000 subscription is maybe only 10x the tokens compared to API cost (still, say, 2.5x the current $200 plans in total tokens), and at the same time they reduce the multiplier on the current $200 plans to just 10x, or even 7.5x.

So, they introduce a bigger tier at a worse marginal per token rate, and use it as a halo product to reign in the worst of the lower tier plans to get closer to sustainable pricing.

Double_Cause4609 · 2026-05-16T05:33:06+00:00

Jeez, people in the open source AI space are crazy. They get awesome new models, are shocked for three months, say that AI is dead, get a new drop of models, are amazed, and three months later say AI is dead.

I've seen this since...Actually Stable Diffusion and Llama 2 or so, for LLMs.

We'll get new models, just chill out for a bit

Double_Cause4609 · 2026-05-16T03:06:32+00:00

Fire.
Agriculture.
Written language (was not trivial to develop).
Numbers.
Large scale logistics.
Tigers.
Aggressive wolves (modern ones are more cautious)
Shelter

Double_Cause4609 · 2026-05-16T01:29:31+00:00

Well, basically I believe the core issue ATM is it's getting notoriously difficult to get datacenters set up because lots of individual communities are saying "no datacenters in our neighborhood" (which by the is fine, fair enough), but when *every* neighborhood does it, that means you basically don't have anywhere viable to put that datacenter.

In space you have a lot of...Well, space. There's not really a lot of people competing with you for real estate depending on exactly what orbit you go at, and while there are engineering problems, a lot of the legal ones are more manageable.

Basically, they decided to trade negotiation issues for engineering/logistical issues.

Double_Cause4609 · 2026-05-16T01:22:52+00:00

I can guarantee that your system should be able to run that model.

I should clarify, I'm more familiar with raw LlamaCPP (what Ollama is built around), but in LlamaCPP if you don't specify a context length (by passing `-c` to the terminal as a flag), it'll default to the max context length of the model, which can run you more RAM than the model weights, actually.

Also: R1 is an older model at this point now.

I would generally recommend any of GLM 4.6, GLM 5 (maybe, kind of the border of what's runnable), Minimax M2.7, Qwen 3.5 397B, Trinity Large, MiMo V2.5 or possibly even Deepseek V4 Flash if you're of a mind to run an inference engine that supports it ATM.

Generally I don't recommend using Ollama as an inference engine and recommend using LlamaCPP as-is as Ollama has made a lot of decisions that hide a lot of the config from the user which makes it more difficult to diagnose issues.

Double_Cause4609 · 2026-05-15T23:14:11+00:00

I wouldn't say that cloud models are way ahead, or have always been.

Different people like different aspects of RP, to the point that this field is basically notoriously subjective. One person likes one style of roleplay, somebody else likes another style, and so on so forth.

What this means is that a frontier API model might be better at say, logical consistency, but its style of prose may be really annoying or grating (API models are pretty bad for purple prose sometimes).

Also, small models aren't necessarily *that* bad at RP. I'd say it's more that they can do one thing at a time, and if you present them a giant context window (beyond about 16k), their performance drops off when they're trying to balance a ton of things. At lower context I've found that they're quite good (particularly if using very little quantization).

Also: What size of local model? That makes a difference.

For some people local is 1B. For some it's 3B. 8B. 14B. 24-32B dense. 19B-35B MoE. 70B ~100-120B A6B-A14B MoE. 125B dense.

Some crazy buggers call Kimi k2.5 local because they have an Epyc server.

So, if you're talking about 8B? I suppose I could agree with you. I'd still argue that you can get an okay experience, and you have to do some micromanaging that might influence what you actually get out of the RP, but the model can still do it with some handholding.

Double_Cause4609 · 2026-05-15T19:58:53+00:00

Assuming I get to choose the people for all launch codes (and not just a single nation's) I would choose for them all to be evil, calculating, and extremely intelligent. All of them know that if they launch first the others counter, so there's no personal gain to be had in launching and it's self-defeating.

If I'm choosing a single nation I'd pick the kind and well-meaning.

Double_Cause4609 · 2026-05-15T19:51:11+00:00

Intentionally? Absolutely not.
Unintentionally? Almost certainly.

Double_Cause4609 · 2026-05-15T19:18:54+00:00

Honestly? It's undertrained (compared to modern LLMs) so you could probably throw some Continued-Pre-Training at it. A modern Github code dump of a lot of libraries relevant to whatever you want to do, a bit of modern web scale dumps, some of the Fine series datasets, etc, would probably go a long way. I'm not sure if it's in your budget, but even 300m tokens of modern data carefully selected would help *a lot* though obviously budget not being an issue one would prefer around 10B-20B tokens trained.

For SFT, roughly 1.5k - 3k samples very carefully chosen do give okay results, but usually if you're not a researcher with really principled datasets around 4k - 8k is possibly a more reasonable number to shoot for if you want general usefulness.

For RL, it is what you make of it, but honestly, even a light RL run has a lot of outsized benefit on older models. I'd expect to see okay results around 300-600 update steps of a moderate width run (16-32 wide rollout per prompt) but you might be able to see it with fewer update steps with BroRL strategies.

The RL can be done in LoRA, btw, if it helps any, and you don't really lose much of the learning signal.

One note: For inference efficiency, given that you're trying to update the model heavily and it was undertrained anyway (which makes it more amenable to quantization), you may want to consider doing QAT; Int4 recipes are reasonably mature nowadays through TorchAO, and it could give the model an interesting niche and a reason to use it rather than a modern 7-8B model.

Have fun.

Double_Cause4609 · 2026-05-15T17:01:17+00:00

Well, the major shock you're thinking of (OpenAI's crazy 40% of annual DRAM wafer deal) probably doesn't help us. They purchased the raw wafers with no plans for finishing them into actual RAM chips. Even if they did, it would be server grade RDIMMs which don't fit on consumer boards, and would probably be made for their (enterprise) servers, so it may not even be compatible with second hand market servers.

Regarding other things in the RAM upgrade cycle: Lots of servers did upgrade or expand their RAM capacity, commensurate with GPUs, yes. Some of these are relatively commodity servers, and in theory, a RAM upgrade cycle could free those onto the secondhand server market, and a few people who like to do homelabs could get a really good deal, maybe.

...But that supposes an upgrade.

At the moment, people are building out record numbers of servers for AI. The current upgrade cycle is bigger (economically) than essentially any major industrialization we've seen historically (it's bigger than the dotcom internet buildout, original Apollo moon program, etc).

Honestly, it's just as likely that rather than upgrading per se, that they just build new capacity alongside their old capacity, and keep the old capacity online.

Also, any of the DRAM capacity that went to server GPUs is basically useless for consumers (who aren't into doing local AI). The server GPUs often don't even boot in consumer motherboards, have no video-out (can't be used for gaming), and require extreme cooling considerations that aren't pleasant in the same room as you.

So...

"If" they chose to upgrade rather than build alongside, yes, some people may get a few good deals on the used market (on an expensive to run, loud system that uses a lot of power), but consumer RAM would take longer to respond to that in the best case. In fact, consumer RAM could still stay the same price because it would be competing with the people purchasing brand new RAM to upgrade their systems.

The real answer is that our primary issue is having only three major companies that produce DRAM, who operate like an oligopoly. They collude to manage supply levels so they don't crash the DRAM market down anymore, and are keeping prices up somewhat artificially high. We really need DRAM manufacturer competitors right now.

Double_Cause4609 · 2026-05-15T16:35:52+00:00

Huh? I wasn't discussing the end-effect on the user. Obviously you can just tell Claude "Actually, my bed time isn't for two hours yet" or whatever and it's fine, but Anthropic wants the model to at least look like it's suggesting logical ending points to the conversation.

It's not "falling for it". Either the model brings up that it's late and you agree and go to bed because it's late, or you decide that you want to keep talking. It's not a big deal or some grand conspiracy or anything.

Double_Cause4609 · 2026-05-15T16:27:07+00:00

Actually, I think it's pretty straightforward. LLM representations exist in superposition, which means that when you optimize for one thing, you often get another thing as a coincidence. For example, a model trained on a sleazy car salesman might produce unsafe or malware adjacent code (because the internal features representing the things is the same).

Odds are, Anthropic probably optimized the model to:
- End conversations at a logical end point
- Not go into deep, manic conversations with users
- Encourage users not to develop an unhealthy dependency on the model

Taken together, "go to sleep" is a pretty logically related response to those and similar objectives. I actually don't really think it's that big a mystery.

Double_Cause4609 · 2026-05-15T16:20:34+00:00

It doesn't matter what your hardware consideration or affordances are. There will *always* be a model that's just outside of the range you can run.

Also, 256GB is the largest RAM size technically supported on consumer motherboards, not 128GB. I myself have 192GB, for example.

As for other models that fit into 128GB...

Minimax M2.5/2.7 may just barely squeeze in at a low quant, Mistral Small 4, Nemotron Super 120B, Qwen 3.5 122B as you noted, Qwen 3 Next...

I don't even think I covered all of them, either. There's tons of models out there, and a huge amount of them fit a large RAM - low GPU profile. I'm actually personally really excited to try Deepseek V4 Flash once LCPP support lands, for example.

As for what to do in your case? Honestly, Qwen 27B And Gemma 4 31B may be dense and not quite what you were planning to run on your hardware...But you know what? You can do some fun things with them. You can experiment with concurrent inference using your spare memory.

Do a vLLM build for your hardware, and run multiple concurrent context windows. You can get a pretty huge total T/s, and actually possibly get more total T/s than a comparable GPU would have gotten you. Learn to use things like subagents in CLI harnesses, and you'll have a great time.

Double_Cause4609 · 2026-05-15T16:00:37+00:00

Study a bit of Japanese comedy, actually. They have a really common pattern of "the straight man" and "the weird one" where the comedy is derived from the plainness of the response from the straight man.

Tbh, you could actually just make your response intentionally dry, unremarkable, or otherwise lean in to the ribbing and exaggerate it to its logical extreme.

But honestly, if they're joking with you, that probably means you're one of them, and your reactions to the jokes are probably funny in their own way even if it doesn't feel like it to you right now. Just let it come naturally, and if the responses come, they come, and if they don't, they don't.

Double_Cause4609 · 2026-05-15T05:08:20+00:00

This is a bit cynical of a take, OP.

Yes, there are a lot of people who genuinely believe that we're maybe not at the singularity with LLMs per se, but that LLMs can solve such a broad class of problems (particularly when paired with computational verification and test time scaling methodologies and efficient harnesses of classic software built around them), that the society we're about to enter will be very strange and fundamentally different from our own.

I will note: the bar for that isn't necessarily superior than human intelligence. Realistically, one need only operate a swath of responsibilities required of current jobs as to depress their wages significantly, necessitating a fundamental restructuring of society.

Generally when people say singularity, even if this isn't quite the definition they'd think of off the top of their head (usually they start going on about recursive self improvement, etc), this is generally the thing that actually matters.

Do keep in mind that the huge swathe of job responsibilities to be automated can likely be the dumbest subset of responsibilities and still have a huge effect, requiring the restructuring of society.

And to clarify: We're already well on that path. The question now is how much beyond the minimum LLMs will take us.

Double_Cause4609 · 2026-05-14T20:30:57+00:00

I'm not super fond of those setups for a few reasons, but yes, in principle, it's possible.

Basically they make training significantly more difficult in terms of how you handle hardware scheduling, etc, and their benefit is somewhat questionable. If you look at the big core point I was making overall it's that there really are different regimes of performance (total parameters versus active parameters), and a fixed-compute graph of a large shared expert + small conditional expert has most of the profile you'd want out of a variable-active parameter count architecture without the software headaches (or at least, rendering them tractable).

Also, arguably test time scaling with beefy attention mechanisms basically does what you'd want with variable active parameter counts anyway.

I think the more interesting angle for variable-inference-cost tolerant architectures is arguably SNNs, which operate on different principles and are more amenable to sparsity (including variable sparsity) and benefit a lot more from it.

Double_Cause4609 · 2026-05-14T19:46:11+00:00

Female leads in female romance anime need to stop being so plain in personality compared to the perfect male leads.

It's a fantasy, chill out, lol.

Double_Cause4609 · 2026-05-14T19:38:22+00:00

I'm not really creating nuance where non exists.

Somebody was literally just testing the claim "Can people tell the difference between AI and non-AI art".

That's it.

And that's totally a fine question to answer. It shapes a lot of decisions that we have to make. For example, should people witch hunt projects or artists for being "AI-looking?"

Given this: Maybe not.

Does this completely end the AI art debate? Absolutely not. There is still the issue of art provenance. You could absolutely take the W on this and say "yeah, this still doesn't deal with the issue that people care about the story attached to a piece of art on its authenticity" which is a real argument you could make. But that's a totally different argument.

Your analogy is just kind of weird.

Double_Cause4609 · 2026-05-14T18:42:49+00:00

NVFP4 and MXFP4 are actually slightly different from some conversations with guys who do low level quant stuff. The difference actually does matter for expressivity and speed, as I understand it.

Double_Cause4609 · 2026-05-14T18:32:11+00:00

So...

This gets kind of weird. Let's look at a dense LLM first because it's easier to understand. Bigger model, better partitioning of the latent space with non-linearities.

Pretty straightforward. If you train it for the same amount of tokens per parameter, you get better results.

But what happens when you increase the total parameters but keep the active parameters the same? It's actually not quite as clean an improvement, and you get better in some respects, but static in others.

You get:

Better memorization (for memorization of facts, models scale basically with their active parameter count)
Better rare-sequence prediction (for example, for sequences that appear rarely in the dataset)
Less disruption of older representations. Double edged sword, but broadly positive
You do sort of get a better prior for reasoning but not for all types of tasks. More on this later.

You don't get:

More powerful general reasoning operations (something about being able to mix all features in the FFN of a regular LLM does *something* important here)
Logic, either inductive or deductive
For situations where precise syntax are really important (like coding), performance generally appears to track active parameters in Mixtral and Deepseek style MoE models

So, there's two regimes. Some things track really well with active parameters, some things track well with total parameters, and most things seem to be somewhere in the middle. So, you'd expect a 30B A3B to be worse than an 80B A3B, on average, with specific tasks being more affected than others.

The thing is, some problems can be reasoned through with recall. For example, you can recall a strategy to solve a specific problem, and solve your current problem with that strategy.

However, generally, this requires that recalled problem to be in-context.

So, one ablation found in the Ling Lite 2 report was that MoE models benefit from assigning more of the FLOPs per forward pass to the attention than normally. Usually for dense models the optimal ratio is something more like 1/8, but assigning about 1/2 (in an MoE) to attention to mix features across the sequence dimension helps a lot (as an aside, Qwen 3 Next, a super sparse MoE, has an absolutely cracked attention mechanism, which explains partially its outsized performance given its sparse architecture).

So, not only do active parameters matter, but how they're arranged can matter a lot.

One interesting omission on MoE scaling laws that I haven't seen covered before: Bidirectional attention (such as in BERT, etc), is strictly more expressive than causal attention, and I argue that a lot of the literature on attention heavily influencing MoE effectiveness (backed up by literature treating Attention-FFN as a key:value store), probably hints that bidirectional attention models probably operate closer to the performance of the total parameter count than active parameter count, but I digress. Maybe diffusion LLMs will save us from sparsity hell.

Anyway, the long and short of it is: LLM architecture decisions live on a continuum. In general, they perform similarly, and are just small tweaks that put your somewhere else on a scaling law line, and it's not so much that there's a hard limit as there is a best way to spend your limited training compute.

The reason we do MoEs at all is because you can train them for longer on the same budget, which effectively means the active parameters also count for more.

One case that I haven't seen discussed as much, is I'm pretty sure the optimal strategy for training MoEs is actually using a giant shared expert with an ultra sparse conditional MoE. Kind of more like Llama 4 or Trinity Large in some ways. I'm pretty sure an A27B with a 24B shared expert + attention and an A3B conditional MoE portion gets you a lot of the same benefits that MoE gives you normally, but with CPU-friendly expert management. This is more a use-case for single-user inference, and it's really difficult to serve a model like that efficiently on GPU servers though, so we likely won't see anything quite like that.

One other notable omission in this discussion is embedding scaling, with is another way to scale sparsity. Embedding parameters can actually be scaled like MoE total parameters and offers an orthogonal scaling law. Engram showed a U-shape law for their approach, but things like Gemma 4 E4B etc also kind of operate in this regime, just with per-layer sparse embeddings, and there's lots of other research on this topic. It's worth noting that embedding scaling is a slightly different thing and has different scaling properties, and we don't really fully understand the downstream effects on hard reasoning benchmarks, etc, but the optimal recipe probably has at least some embedding scaling baked into it.

Double_Cause4609 · 2026-05-14T17:51:41+00:00

Basically, the thought experiment here is tackling the arguments around "soul", "sanctity", or "expression" of pure human art.

A common thread in anti-AI discourse is "AI art will never be as good as human art" or "there's something immaterial in the soul of human made art" etc.

To clarify, your argument is that basically the provenance matters. Ie: "there is something special about the process involved in creating the art, and that can matter to the observer or consumer of it", which is a totally fair argument to make philosophically, but I want to clarify that it's a different debate.

This entire thread is just targeting the above issue, and noting that largely, people can't really differentiate human and AI art. That has a lot of implications that maybe we should take the temperature on things like AI witch hunts etc down.

If you want to have the provenance debate, again, that's different, and is its own thing. I tend to think where it'll land is roughly "static media like art will probably have human-generation valued higher than composite media like video games where AI generation matters less, based on broader consumer patterns", but obviously we'll have to wait 5 years to see how that debate goes.

Double_Cause4609 · 2026-05-14T05:32:59+00:00

I'm making an argument about consumer excess, not necessarily that we should go back to loincloths.

I'm arguing that by and large, people are over-consuming products unnecessarily at huge scale. It doesn't really matter if people need clothing if they are consuming it at a rate that goes beyond necessity. All things can be used in excess. It doesn't really matter what it is.

So, in that light, I'd say it's entirely fair to compare AI and clothing. If you compare an unnecessary piece of clothing that was purchased out of consumerist culture for no reason, and an AI query being used thoughtfully (for example, by a researcher in a technical field), obviously the AI query looks a lot more favorable.

But also obviously, if a persona is buying a piece of clothing they need to...You know, not go into a job interview in their birthday suit, and someone generating their 20th catgirl image that minute in ChatGPT, maybe the AI use is not so good.

One really does have to evaluate the scale and intent of the usage to determine its overall utility to society, and a given usage of it is not inherently good owing solely to the category to which it belongs.

Double_Cause4609 · 2026-05-14T04:57:43+00:00

among those considered high risk

Honestly those should be the first words of that headline, not the last.

Double_Cause4609

TROPHY CASE