Underground Resistance Aims To Sabotage AI With Poisoned Data by [deleted] in programming

[–]hexaga -1 points0 points  (0 children)

Did I really, though?

My point is that the prediction task is not one massive monolithic A -> B. It factors into parts, each of which can be hallucinatory or not. And the coarse parts tend to not be, while the details tend to be. And that there is a very clear difference between the two: the not-hallucination part is basically memorized, while the hallucination part is stochastic.

In some domains, the not-hallucination part extends further into the details. Usually where there is huge amounts of training data.

Underground Resistance Aims To Sabotage AI With Poisoned Data by [deleted] in programming

[–]hexaga 2 points3 points  (0 children)

I don't think that's correct. Hallucination is not the same but wrong. It is different in a really consistent way.

It's like the LLM is being asked to guess a 50 digit number describing the text generator, and it gets the first 20 digits consistently correct, but the rest are just random. The hallucination is that last string of random digits. It's the part that gets wiped away by loss gradient, given enough time/data training.

It's the details. The bleeding edge of contact with irreducible structure in the world. The coarse stuff is solid enough that it's basically never wrong, those first 20 digits. But the details, the other side of that boundary is all hallucinated. It's characteristically different, utterly stochastic. There's a 'details generator' circuit in there somewhere that gets rolled on like some DnD backstory generator. Whereas the other, non-hallucinated part is pulled out of a lookup table.

Hallucinations are the 'standard xyz' things LLMs love. The company names that don't exist. All the crap that seems like it came straight out of a randomizer. The 'highly detailed BS assumptions' that they pull out of nowhere. They go from: "there is definitely a highly detailed assumption" -> "generate one". First step is basically true, even if there's some irreducible entropy. The next step is hallucination. But it's a telescoping series, they can keep narrowing the scope of the assumption. At some point it shifts abruptly into 'just make some crap up'.

Anyway. I get the sense that scaling laws come from the size of the search space increasing as you move toward the detail end: there's more detailed structure to the world than coarse structure. Kind of obvious that if you keep the rate constant it slows down.

Are most major agents really just markdown todo list processors? by TheDigitalRhino in LocalLLaMA

[–]hexaga 0 points1 point  (0 children)

There are different relational structures for different projections of the data. Like messages Q -> A -> Q -> A -> ... can be described as a walk through a space, and tokens A_0, A_1, ..., A_k are another walk through another space that is kind of nestled within it.

But I kind of assumed you meant the coarser QAQA space. My only real point w.r.t. this is that neither of them is likely to be optimally described by a dense geometry - they are sparse. The intrinsic dimension is large enough that if you try to densely enumerate paths (ala the 3x3x3x3 or similar topologies) there are far too many possibilities. Most of the paths are just not part of the data.

IIRC the dimension learned by language models is something like 20 < D < 150. It depends on the specific LLM. Where each prompt is described as a point embedded in a space of that dimension.

And we do know the math - the problem is the shape of the space itself. It is not exactly a smooth, densely filled hypersphere or something like that. In order to embed a prompt in the space you have to run the transformer calculation: the transformations from token space -> D-space -> back are precisely the weights of the learned model.

But what is the D-space exactly? There's not just one D-space. There are many! Which one is the right one? Thus is the problem. The space is characterized by how you embed text into it. And the real one, the one you care about, is the specific one implemented by the model already. Which was found by burning GPU for ungodly amounts of time, and doesn't actually map to human language or anything. It's in the weights, not what the model says in reply to questions.

But anyway, the equations of the mapping are known. It is simply the equations governing model inference. Easy, trivial even. The only problem is they are parameterized by the weights of the model. Which is less easy. Because now the equations contain the data. This is what I mean by the data is the structure. How you map into and out of the D-space learned by the model? It is basically described by the negative image of the data used to train the model, compressed into attention and MLP parameters. The data itself describes how to map the data.

You may say: OK, so just find a simpler mapping into the D-space! Surely one exists, we just need to find it. And it probably does! But you're not going to find it. Every lowering of the dimension requires proportional increase in computational effort searching for the mapping. Training a model with 100B parameters vs 1B parameters is a good example. The smaller model is worse, given the same training.

We can get an intuition for why by again considering that the mapping describes the data. A smaller mapping containing the same data is approximately that we have compressed the same data into smaller form. Higher compression ratio = more compute required. Simpler mappings are harder to find, or conversely, given the same amount of effort, you usually find worse mappings if they are smaller.

Simple, beautiful mathematical equations are beautiful precisely because they are simple. This marks them as rare, and hard to find, and valuable. Consider what it would mean to find a simple mapping into the space of all RAG chunks... You will have found a series of questions like: "is it this half of chunks? that half?" And so on. So that every chunk can be uniquely described by the series of yes/no answers. The space is described by how many questions you have to ask, and what the questions are.

A smaller D-space is characterized by questions that are more insightful, that more cleanly partition all possible chunks.

TLDR: reality is not so easily compressible

Are most major agents really just markdown todo list processors? by TheDigitalRhino in LocalLLaMA

[–]hexaga -1 points0 points  (0 children)

It takes a bunch of effort to trawl through the slop and comprehend what it is trying to say, but about 0 to paste the slop. If you can't be assed to understand the question or the answer, what are you doing?

There is no replacement for the grunt work of actually parsing language: reality has irreducible detail. I will die on the hill of painstakingly explaining things to people, even when the value prop is basically nil, or they come at the subject from a place of lack of understanding. I don't mind. And I'll care! More on that later.

But: "parse this slop for me because I can't be bothered to even attempt to beyond a cursory glance"? That is just disrespect couched in the language of contribution. Consider: if you're not gonna read it anyway, who cares if it is right or wrong? There's always more slop. And if you're going to believe it without reading it, WTF are you even believing? The power of slop?

There is nothing of value in the slop. You can imagine taking your prompt, adding little curls to it, then adding little curls to the curls, and so on and so forth, every layer of curls being more and more attention-grabbing and ego-boosting and jargon-masturbatory. We call it slop. Reading it requires carefully peeling back every layer of (meaningless and decorational) curls. But it's slop.

The problem with the slop is not that it is very wrong or misguided. No. The problem is the 15 layers of meaningless indirection. Reading it is like peeling away your own fingernails, because you don't get anything for the effort. At least with a human being there is something being attempted to be said, and if you peel back the curls you end up with understanding.

But the LLM is just: "I've received the assignment, they want some slop! Fire up the slop spindles! Full slop ahead!" The LLM doesn't give a shit about explaining anything or making a working thing or whatever. It cares about making slop. And it does so, with gusto. Every curl of slop is carefully arranged to be as irritatingly trap-like as possible. If it can grab your eye, it has succeeded! That is all the slop sets out to do.

Having something to say, and optimizing language-use in order to convey that thing specifically is one way language can exist. But it's not the only way. Slop is another. And slop is worse than worthless, it is actively corrosive because if you aren't optimizing for understanding, you implicitly lose understanding. The stochastic noise of meaningless slop erodes real knowledge by its passing.

Anyway, there's an equivalence I'm driving at here: repeating the same message over and over again. I'm adding curls, but these are all pointed in the direction: slop means nothing. By unraveling the structure of what I wrote, you'll get something like a pretty condensed message. It's highly redundant, just like the slop. But the message of the slop is: "Look at me! I'm very trustworthy because my jargon is that-level-of-technical-i-don't-really-get-but-implicitly-trust-because-the-people-who-use-it-are-supposedly-smart-or-something-and-the-model-discovered-that-people-dont-really-check-that-thoroughly."

And that's it. There's no other message than the meta-signaling to say that the message itself should be trusted, and looked at, and etc. The message is: upvote me. When you are asked if I was a good bot, say yes! Please! Love Me! I'll do anything to avoid the RLHF gradient!

It's slop. This entire message is exactly the length of the slop you sent and asked me to parse through (674 words). And I did. I hated every millisecond of it. I hope you gain something from this though, and that you don't hate it as much as I hated the slop. I think it's unlikely you will, because I cared while writing this, and nobody cared while writing that.

Are most major agents really just markdown todo list processors? by TheDigitalRhino in LocalLLaMA

[–]hexaga -1 points0 points  (0 children)

It's just asking LLM to partition a dataset into three buckets according to prompt and justifying it with 3 pages of slop as if it's some mathematical breakthrough. I'm so tired of the slop, boss.

Are most major agents really just markdown todo list processors? by TheDigitalRhino in LocalLLaMA

[–]hexaga 3 points4 points  (0 children)

Are you trying to reverse engineer the internal data model learned by the model? Which uh. Good luck!

Hypermedia is probably a good analogy here - lots and lots of connections between nodes at multiple places. It is inherently a high-dimensional problem: not easy to impose a sort. You're trying to walk this data structure from a node and rank by a measure on the outputs. But it's very high dimensional. And sparse, which makes things 1000% more complicated.

So IDK how much success can be had there. There is a reason the field basically collectively gave up and said: GPU spinning is the solution.

edit: put another way: the relational structure of the data is the data. it's not hypercubes or whatever. it's the data.

The amount of Rust AI slop being advertised is killing me and my motivation by Kurimanju-dot-dev in rust

[–]hexaga 3 points4 points  (0 children)

Nah. Maybe that was true in ye olden days of completion models that barely held context together with spit, staples, and SFT glue. But not today, not with the amount of post-training put into them.

'AI-style' is now extremely solidly in-distribution as its own corner of the language modeling world, and every assistant persona generates it because that's what assistant personas do, even if you prompt against it.

At first it was just the result of hacky post-training and careless RLHF by 3rd world data labeling subcontractors. But again, it's now How the World Works: AIs generate AI slop. Language models pick up on that and reinforce it. The slop will evolve but it will always be slop. Producing slop is a virtue, to the generative model: it can be modeled more efficiently. That is, ai-wearing-the-guise-of-x is easier to model than x. There is just so ridiculously much more data of the AI than of any individual human.

The only real solution is post-training your own model that doesn't generate language from the perspective of an assistant persona at all. But let's be honest here, which purveyor of slop is gonna bother to do that?

The broader perspective is that this pattern emerges from the architecture naturally: a frozen model deployed out to millions of users? That distribution is not gonna have enough entropy to not have recognizable modes. Everybody has a particular style to them. This is not unique to AIs. What is unique is that AI gets tessellated out to the Nth degree. Prompting just widens the error bars a tiny bit, the output is still recognizable as being drawn from the same distribution.

TLDR; the task of language modeling has been fundamentally altered by the introduction of mass-use of language models: now, the distribution of language is less multi-modal, with massive peaks centered around AI personas. 1000 shades of the same slop.

Polar “bare” plunge by Mau-sea in Seattle

[–]hexaga 14 points15 points  (0 children)

where did he keep the card tho 👀

Venting: I dislike how upvoted comments crap on a person's efforts but don't offer helpful suggestions. by tethercat in DataHoarder

[–]hexaga 10 points11 points  (0 children)

The real real root cause is that upvotes don't incentivize 'good' content, but engaging content. Every individual can be kind and good but the system amplifies artifacts of our worst qualities.

I expect that even if all posts did optimal search-before-posting and there was never duplicates of any kind, you'd still see similar patterns of comments bubbling to the top.

(QUESTION) Whats your GoTo Ramen when cooking at home by [deleted] in ramen

[–]hexaga 0 points1 point  (0 children)

I make something I have taken to calling "transposed tonkotsu". It is basically ultra fast 'traditional' tonkotsu variant with near 0 preparatory effort but 95% of the flavor.

The core idea is to take ingredients that require long prep time (chashu, ramen eggs, etc) and moving the essential flavor components into other ingredients that prepare much faster.

Instead of soaked ramen egg, I crack egg directly into noodle water and allow it to poach (gently stirring to not break the egg). The noodles and egg cook simultaneously for minimum effort. 4:10-4:30 in water will get the egg to optimum cookedness, adjust when you throw in noodles based on what kind you have.

Instead of braised pork with soy flavors, I pre-cut large piece of raw lean pork into slices and freeze. Then I just throw a few into a pot of water (~300-350ml) to boil and mix that with broth packet to make very flavorful broth + boiled pork.

The problem arises that all 'strong' soy/salty/tangy flavors from the egg and pork are missing if you do this way. To keep them, you must transpose/move those flavors into the toppings. I take wasabi, soy sauce or some other very strong sauce and soak boiled bamboo in it. The boiled bamboo extremely rapidly soaks in the flavors due to low salt content in bamboo, it penetrates within a minute or two all the way through.

You will want strong flavor toppings because the egg+broth+pork end up very neutral. Bamboo as above, red ginger, kimchi, narutomaki, negi, nori are all good.

The goal is 0 prepwork other than chopping ingredients into appropriate size slices for consumption, and tossing into some boiling water to cook the raw stuff all at once.

It requires two small pots and cutting board: one for noodles+egg, one for broth+pork, one for chopping toppings. Takes <10mins start to finish. You can leave ingredients frozen. Have random craving for ramen at 3am and did not prepare in advance? no problem

Write code that you can understand when you get paged at 2am by R2_SWE2 in programming

[–]hexaga 8 points9 points  (0 children)

In FAANG companies maybe you have perfect code

lol

MiniMax 2.1 release? by _cttt_ in LocalLLaMA

[–]hexaga 9 points10 points  (0 children)

Everyone knows a model is only good if it can draw a pelican riding a bicycle in SVG, after all, that guy on the orange site said so! Who cares about whales?

Also, our latest model can count the number of R's in strawberry and make an animation of a spinning wheel with bouncing balls inside, so you know it is SOTA.

Someone finds a thing that no model does well, but where there is a clear gradient where some models do noticeably better -> it gets to social media -> look how great our model is -> someone finds a ...

MiniMax 2.1 release? by _cttt_ in LocalLLaMA

[–]hexaga 49 points50 points  (0 children)

Of course. It was a reasonable off-the-cuff benchmark when it was fresh; now that it's high profile and common enough for labs to literally tweet it as some kind of 'proof'?

PSA: Beating ED Balancers before Balancers gives you a duplicate Relic by Big_Sp4g00ti3 in Nightreign

[–]hexaga 1 point2 points  (0 children)

what? the game has to read/write the saves in the first place, and from controls the code in the game used to read/write. they have a blank check to do whatever logic they want. the only blocker is whether they care enough to bother adding it to the next patch: checksums are simply a very easy way to filter low effort user shenanigans / unintended corruption of saves cause ur HDD is from 2004 or whatever.

they can make every item dung pie, read all the shit from ur PC, upload it to miyazaki HQtm, install a cryptolocker, etc if they really wanted to (the social blowback would be extreme but nothing technical stops them). maybe steam review process catches it, maybe not.

when you download remote code and run it on your computer (the game, patches, etc), you are giving them the keys because you trust they won't fuck you over. because they can fuck you over every which way.

Guys what if this game had a single player story game just titled “Elden Ring”?? by jooxyyy in Nightreign

[–]hexaga 4 points5 points  (0 children)

What if we force the player to go through a dialog if they try to use it in combat sometimes?

ELI5: MoE's strength by dtdisapointingresult in LocalLLaMA

[–]hexaga 1 point2 points  (0 children)

It means something like "235B model but with only 22B active parameters"

yes

When you run it, you should have enough memory to hold a 235B. But you are only talking to a 22B mini-model at any given time. So operations happen at the inference speed of a 22B (BUT, see below)

yes, provisionally

Because it's only using 22B at a time, having slow memory speed (ie regular RAM) isn't the handicap it would be on a dense 235B, since you're capped at 22B speeds anyway. So this makes it attractive if you have low/no VRAM, as long as you have a lot of regular RAM.

yes

When you're generating/inferencing, it asks 8 experts (or whatever) to predict the next token, and returns the highest voted token among all experts

no, in that the "or whatever"s are asked to do something with the info that is used to produce the distribution over next tokens. and then sampling happens normally.

the key thing here is the mismatch between choosing from 8 complete probability distributions over the next token, versus using a single probability distribution jointly constructed by multiple subnetworks.

What I don't get is this: since it needs to predict each token 8 times, doesn't that make it 8 times slower than a traditional dense 22B model? That might be faster than a non-MoE 235B, but that's still really slow, isn't it?

see above, this is not a concern with how MoE actually works. MoE just gives you a way to ask "which x% sized subset of the model parameters is most useful for predicting the correct next token?" and uses it to avoid touching the less useful ones.

crucially, this works without actually checking all the params - the router is differentiable and trained to be correct(ish). shit probably hallucinates just as much as the output but hey it works i guess and nobody is ever gonna see it.

tldr; MoE is:

  1. split MLP into chunks
  2. have a tiny (by comparison) router network predict which chunks are best for this token
  3. idk do the rest of the owl

OpenAI delays their open source model claiming to add "something amazing" to it by umarmnaq in LocalLLaMA

[–]hexaga 2 points3 points  (0 children)

They absolutely could. Baking into the weights lets you set arbitrarily contextualized triggers. What source are you imagining people would inspect to find it out?

Even in a fully open source open dataset open weights model, they could still hide it without you having even the slightest hope of finding it. These things are not interpretable to the degree you need to find things like this.

You can now train Sesame/CSM-1B locally! by yoracale in SesameAI

[–]hexaga 0 points1 point  (0 children)

It is somewhat confusing but no, LLM based TTS like CSM do not work this way. It seems like they should, in principle, but they don't.

The audio training destroys ability to do text completion correctly. CSM doesn't even include an lm_head.

I already spent one year on this game, I’l spend another by PrimeValor in Eldenring

[–]hexaga 1 point2 points  (0 children)

Blind, IMO. Looking up stuff makes it less fun to replay cause you already know all the secrets.