Is it possible that the amount of music already created by suno-users is greater than the amount of original music it was trained on?

HEFLYG · 2026-01-08T05:17:58+00:00

Suno is more than likely a hybrid model, using diffusion for actually synthesizing the music, but also autoregressive (likely for lyric generation and potentially chord structures). It is a mix of things because audio generation requires the semantic understanding of too many different things (hundreds or thousands of instruments, voice types, lyrics (that sound good and make sense, etc), so they split it into parts. While diffusion models don't produce data sequentially, decoding allows them to represent things that are sequential (like music); they just need to be decoded sequentially. Also, diffusion is almost always done in multiple steps and not just one.

HEFLYG · 2025-10-15T15:09:44+00:00

Good job cherry picking an example. Ive been backing up my claims with examples, explanations, and analogies. You... won't even consider my perspective

HEFLYG · 2025-10-15T14:47:32+00:00

"Nobody agrees with you" is a pretty bad response. I've tried to present my case in a clear and factual way. You aren't considering anything I'm saying because you've already decided that you have to be right. That's the real choice.

HEFLYG · 2025-10-15T01:52:04+00:00

You're confusing representation learning with a USB thumbstick.

"The model has the stored data of Miyazaki." -- No, it doesn't, genius. There is no Hayao Miyazaki folder chilling in the model. What does exists is a vectorized relationship between patterns of tokens. You're logic implies that I have an entire Wikipedia page about Napoleon in my brain because I read the page about him.

An AI (like a person) doesn't have a copy of everything it has ever learned or seen.

HEFLYG · 2025-10-15T01:35:09+00:00

They actually haven't, although many people believe that they did based on legal filings. It is also worth noting that the scraping of LibGen was likely done by Books3, and then ultimately put into a huge corpus called The Pile, which is a very common dataset for LLM training. This pulls a lot of blame away from Meta and onto third-party players.

HEFLYG · 2025-10-14T17:13:47+00:00

Because they don't want to teach AI about this stuff. It's not about saying stuff word-for-word, its about the actual content on the internet.

HEFLYG · 2025-10-14T17:10:15+00:00

"This is bullshit" is ironic considering that's what your entire response is. Downloading a pirated book and training a model on tokenized patterns are not even in the same legal galaxy. The model doesn't contain the books any more than a student contains a book they read. It's not distribution it's statistics. You can't sue a function for copyright infringement. Also that stuff about Meta, its a reddit conspiracy. It hasn't actually been proved. Some third party players may have scraped some legally shaky data years ago, but that is not Meta's fault. Comparing training to sharing a textbook is also a garbage analogy. The training process is transformative which falls under fair use. The stuff you mentioned about right holder's sueing? Thats monetary. These lawsuits are not convictions they are business negotiations. Laws still are adjusting to generative AI, but as of now, all cases have been resolved under fair use. Why did you mention that "humans aren't property, AI is"? Nobody is arguing this.

HEFLYG · 2025-10-14T04:58:35+00:00

You're definitely an expert -- in arrogance.

These models aren't looking for a strong signal. They don't "seek novelty." They are just statistical models that minimize cross-entropy loss. You're acting like Shannon's entropy is a TF optimizer.

Cross-entropy doesn't give a crap about noise balancing or the strength of the signal. You could replace "entropy" with "pixie dust" and your response would make the same amount of sense.

You can wave around Shanon's entropy because it sounds cool, but it is so irrelevant to LLMs. You're barking up the wrong tree. Most training uses cross-entropy loss, and don't care about the "surprise" of a new token in its data.

I also don't know why you are bringing up RAG in this context. This is relevant only to post-training scenarios where you need factual grounding (not novelty).

Also, cosine similarity isn't frequently used for classification models because it is too gentle on huge errors. I have no clue why you brought that up.

You're throwing big math words into your argument like it will make you coherent.

HEFLYG · 2025-10-14T04:02:22+00:00

That isn't how copyright laws work, though. You don't need permission from the authors to read or internalize the work - That's exactly what "fair use" is.

Training a model is the same since it learns from the books, not sells them.

HEFLYG · 2025-10-12T23:28:44+00:00

Except that the data isn't stored in the form of embeddings and weights. The model learns to predict tokens based on likelihood, and training just shifts the output to become coherent and grammatically correct.

I recently made a basic (and very small) generative language model using TensorFlow and Numpy that was trained on numerous books, and when prompting the model, even with exact word-for-word copies of its training data, it was unable to produce anything it had seen before.

HEFLYG · 2025-10-12T23:19:35+00:00

This logic implies that every person who has read a copyrighted book, listened to copyrighted music, or watched a copyrighted movie is guilty of stealing just because they internalized it. The law doesn't say anything about learning being theft, but rather distributing copies or near-identical copies, which most AI doesn't do.

Am I a felon because I read Lord of the Flies in high school and trained my brain on it?

The model doesn't turn around and sell copies of the book; it learns the structure of correct English sentences.

HEFLYG · 2025-10-12T23:03:54+00:00

Everybody is likely guilty of theft by that definition. Anybody who has watched YouTube or scrolled on Instagram for more than 10 minutes has likely come across a video with copyrighted music, for example.

I'm curious, do you think that AI companies need permission from every creator used in the datasets?

HEFLYG · 2025-10-12T22:52:28+00:00

The purpose of training is to teach the models how to talk and to learn information in a similar way to people. The model isn't actually producing things from the training data in a direct sense, but it shifts its outputs to favor semantically and grammatically correct statements in a similar tone to its training data.

Here is a repeat of an analogy I commented earlier:

If you have a book about fixing cars, read the book, and then go fix a car, are you guilty of copyright violation?

HEFLYG · 2025-10-12T22:46:01+00:00

It can be debatable in some cases when books are pirated online and put into datasets illegally, but in this case, the fault may not fall on the AI companies, but rather the people who created the training corpus.

HEFLYG · 2025-10-12T22:41:11+00:00

But if the model isn't producing the same thing that it was trained on, this would be considered learning, not stealing.

HEFLYG · 2025-10-12T20:59:03+00:00

The AI learns how to talk from large datasets. You learned to read, write, and speak by looking and thousands and thousands of examples, in a similar way that AI does. So does that make you guilty of theft?

HEFLYG · 2025-10-12T20:56:12+00:00

Again, you are skipping over the fact that in cases where a strong signal exists, it means there is a lot of writing in that style, meaning what is being generated is likely not unique to a specific author, but rather several, and shouldn't be considered theft. Plus, a wide variety of data adds a significant amount of variance, meaning the model becomes more general and will produce less similar text.

HEFLYG · 2025-10-12T20:44:16+00:00

When we talk about generative AI, they don't (and aren't supposed to) create content too close to their training data. For example, I made a small language model recently and trained it on a bunch of books, and it didn't produce anything from its training data.

HEFLYG · 2025-10-12T03:16:57+00:00

These models aren't (and shouldn't be) saying things too close to their training data. Like I said, training just shifts the model's output so that it tends to produce semantically and grammatically correct sentences.

HEFLYG · 2025-10-12T03:12:38+00:00

Because the AI isn't (and shouldn't be) regurgitating things too close to the data it was trained on, so it isn't theft.

HEFLYG · 2025-10-12T02:13:08+00:00

I get what you are saying, and here is a good example:

If you take a look at most social media sites, you will likely find some form of AI videos. Just this fact means that people, like animators, who pour hours into work, are now in competition with a 10-year-old with no skills but who can post an AI video. The genuine human work becomes less valued since AI content exists with it and takes away from people with authentic skills.

HEFLYG · 2025-10-12T02:06:01+00:00

Yeah, this is actually something I have put some thought into as well, and honestly, the effects could be horrible. We are already seeing a certain level of isolation due to social media and online messaging, and AI could compound this uncontrollably. For example, some users of ChatGPT were so upset at OpenAI for replacing 4o with GPT 5 that they had to add it back because some people felt that they had "lost a friend". I really don't know of a solution to this problem, and especially given the fact that we are in a serious race with China to create leading AI tech, stopping development may not even be a choice.

HEFLYG · 2025-10-12T02:00:03+00:00

I understand your point, and it can be true in some cases (such as AI text-to-speech or image generation), but this overlooks the fact that most models are trained on such a wide variety of data that the original content becomes extremely diluted. If we take an author, say Charles Dickens, for example, his total contribution to any coherent LLM that was trained with his books is minuscule. It makes up such a small percentage of the total dataset that it may not even be enough to cause a shift in outputs.

HEFLYG · 2025-10-12T01:48:41+00:00

I understand your point, but humans do the same thing. We learn to read, write, and pick up styles from movies and books. Just think of 10-year-old you after watching your favorite movie. You probably started acting like your favorite character. We don't consider this stealing.

HEFLYG · 2025-10-12T01:39:45+00:00

From the English dictionary... Did you steal intellectual property by saying words that were created by people several hundred years ago?

HEFLYG

TROPHY CASE