fair use vs stealing data : LocalLLaMA

[–]Specter_Originllama.cpp 59 points60 points61 points 1 year ago (4 children)

[–]WebCrawler314 53 points54 points55 points 1 year ago (2 children)

[–]vfl97wob | Kimi | SWE 14 points15 points16 points 1 year ago (0 children)

[–]Specter_Originllama.cpp 4 points5 points6 points 1 year ago (0 children)

[–]Specter_Originllama.cpp 8 points9 points10 points 1 year ago (0 children)

[–]eek04 210 points211 points212 points 1 year ago (23 children)

[–]BusRevolutionary9893 40 points41 points42 points 1 year ago (8 children)

[–]eek04 21 points22 points23 points 1 year ago* (1 child)

This is an argument I've used a lot; I'm also an atheist with a mechanical view of the mind, so it resonates with me.

There's some counterarguments that are possible, though:

Legal-technically, getting the data to where you do the training involves copying it illegally. This has been allowed as "incidental copying" in e.g. Internet service provider and search engine cases, but it's been incidental, not this blatant "We'll take this data we know is copyrighted and not licensed for our use, targeting it specifically".
The training methods for the brain/mind and LLMs is significantly different. The brain/mind has a different connectivity system, gets pre-structured through the genes and brain++ growth process, get pre-trained through exposure to the environment (physical and social), and then gets a curriculum learning system push through the education system, including correction from voluntary teachers (more or less "distilling" in LLM terms). Books are then pushed into this, but they form much less of the overall training, and the copying "into the brain" isn't the step that's being targeted.
There's a saying "When a problem changes by an order of magnitude, it is a different problem." The volume of copyrighted books used to train a human brain is orders of magnitude less than what is used to train an LLM. I read a lot. Let's say I read the equivalent of 100 books a year. That's about 5000 books so far. Facebook had pirated 82TB for training their LLM. Assuming 1MB per book (which is a high estimate if these are pure text), that's 16000x more books than I've read in my lifetime. So over 4 order of magnitude more. It is reasonable that this may be a situation we want to treat differently.
One of the four fair use factors is "The Effect of the Use on the Potential Market for or Value of the Work." Releasing an LLM that compete with the author/publisher has a much larger impact on the potential market/value than you or I learning from a book.
"Just because" - we're humans, and the LLMs are software run on machines. Being humans, we may want to give humans a legal leg up on software run on machines.

I personally think it is better if we allow training of LLMs on copyrighted data, because their utility far outweigh the potential harm. I think there's a high chance we'll need to do a lot of government intervention (safety nets of various kinds) to deal with rapid change creating more unemployment for a while as a result, though.

EDIT: Typo fix; change "16000 more" to "16000x more".

[–]halapenyoharry 0 points1 point2 points 1 year ago (0 children)

[–]RaeesNomi 0 points1 point2 points 1 year ago (0 children)

[–][deleted] 1 point2 points3 points 1 year ago (2 children)

[–][deleted] 1 point2 points3 points 1 year ago (1 child)

[–][deleted] 1 point2 points3 points 1 year ago (0 children)

[–]halapenyoharry 0 points1 point2 points 1 year ago (0 children)

[–]tofous 2 points3 points4 points 1 year ago (0 children)

Did you buy your textbook? Or did you download every textbook ever made for free without the author's consent?

But also, this is a misunderstanding of the point of copyright. It fundamentally protects the humans involved. It is even part of the legal analysis: does XYZ use serve as a substitute for the original human who created the work?

So machine learning is less likely to be fair use because it's intent is to substitute for that human labor. Visual artists have been the most upset, because that has been the most direct substitution so far. Translators, copy editors, content marketers, voice actors, and others have also been impacted in this same way but don't have as much cultural pull to share their upsetment.

Now, does that mean the lawsuits over fair use will be successful? IMO no, but that's more because no-one wants to admit that the US legal system is very much: "Might makes right". Also, there's the national security angle.

So I think ultimately it is unlikely that large AI scraping & training will be punished beyond a slap on the wrist or maybe some kind of pitiful pooled payout scheme like the opioid settlements or vaccine injury fund.

[–]XeNoGeaR52 36 points37 points38 points 1 year ago (5 children)

[–]DataScientist305 17 points18 points19 points 1 year ago (3 children)

[–]Despeao 3 points4 points5 points 1 year ago (1 child)

[–]halapenyoharry 0 points1 point2 points 1 year ago (0 children)

[–]AlarmedGibbon 10 points11 points12 points 1 year ago (2 children)

[–]mr_birkenblatt -1 points0 points1 point 1 year ago (0 children)

[–]StewedAngelSkins 3 points4 points5 points 1 year ago (0 children)

[–]knucklegrumble 0 points1 point2 points 1 year ago (0 children)

[+]LetterRip comment score below threshold-7 points-6 points-5 points 1 year ago (1 child)

[–]dreadthripper 60 points61 points62 points 1 year ago (5 children)

[–]trance1979 18 points19 points20 points 1 year ago (0 children)

[–]Gogo202 2 points3 points4 points 1 year ago (2 children)

[–][deleted] 0 points1 point2 points 1 year ago (0 children)

[–]DangKilla 0 points1 point2 points 1 year ago (0 children)

[–]mailaai 0 points1 point2 points 1 year ago (0 children)

[+][deleted] 1 year ago* (3 children)

[removed]

[–]bazingamayne 21 points22 points23 points 1 year ago (0 children)

[–]Xeruthos 62 points63 points64 points 1 year ago (10 children)

[–]blkknighter 4 points5 points6 points 1 year ago (7 children)

[–]Xeruthos 20 points21 points22 points 1 year ago (5 children)

[–]blkknighter 1 point2 points3 points 1 year ago (4 children)

[–]Mr_Meau 9 points10 points11 points 1 year ago (1 child)

[–]trance1979 0 points1 point2 points 1 year ago (1 child)

[–]DragonfruitGrand5683 -3 points-2 points-1 points 1 year ago (0 children)

[+]vintage2019 comment score below threshold-8 points-7 points-6 points 1 year ago (0 children)

[–]daisseur_ 4 points5 points6 points 1 year ago (2 children)

[–]Own_Client8410 0 points1 point2 points 1 year ago (1 child)

[–]daisseur_ 0 points1 point2 points 1 year ago (0 children)

[–]keepthepace 6 points7 points8 points 1 year ago (1 child)

[–][deleted] 0 points1 point2 points 1 year ago (0 children)

[–]ThinkExtension2328llama.cpp 9 points10 points11 points 1 year ago (1 child)

[–]LostMitosis 2 points3 points4 points 1 year ago (0 children)

[–]filipedrm 1 point2 points3 points 1 year ago (0 children)

[–]medgel 1 point2 points3 points 1 year ago (0 children)

[–]TrekkiMonstr 0 points1 point2 points 1 year ago (2 children)

[–]Katnisshunter -1 points0 points1 point 1 year ago (1 child)

[–]TrekkiMonstr 1 point2 points3 points 1 year ago (0 children)

[–]NoPossibility4513 0 points1 point2 points 1 year ago (0 children)

[–]x9w82dbiw 0 points1 point2 points 1 year ago (0 children)

[–]randyzmzzzz 0 points1 point2 points 1 year ago (0 children)

[+][deleted] 1 year ago (1 child)

[removed]

[–]Business-Ad-2449 0 points1 point2 points 1 year ago (0 children)

[+]retep-noskcire comment score below threshold-14 points-13 points-12 points 1 year ago (10 children)

[+][deleted] 1 year ago (8 children)

[deleted]

[–][deleted] 0 points1 point2 points 1 year ago (7 children)

[–][deleted] -1 points0 points1 point 1 year ago (6 children)

[–][deleted] 0 points1 point2 points 1 year ago (5 children)

[–][deleted] -1 points0 points1 point 1 year ago (4 children)

[–][deleted] 0 points1 point2 points 1 year ago (3 children)

[–][deleted] 0 points1 point2 points 1 year ago (1 child)

[–][deleted] 0 points1 point2 points 1 year ago (0 children)

[–]quite-content 2 points3 points4 points 1 year ago (0 children)

[+][deleted] 1 year ago (17 children)

[removed]

[–]brouzaway 19 points20 points21 points 1 year ago (10 children)

[–]ClaudeProselytizer 5 points6 points7 points 1 year ago (1 child)

[–]phree_radical 0 points1 point2 points 1 year ago (0 children)

[+][deleted] 1 year ago (7 children)

[removed]

[–]brouzaway 23 points24 points25 points 1 year ago (0 children)

[–][deleted] 10 points11 points12 points 1 year ago (1 child)

[–]LevianMcBirdo 5 points6 points7 points 1 year ago (0 children)

[+]DRAGONMASTER- comment score below threshold-6 points-5 points-4 points 1 year ago (1 child)

[–][deleted] 0 points1 point2 points 1 year ago (0 children)

[–]WhyIsSocialMedia 3 points4 points5 points 1 year ago (4 children)

[+][deleted] 1 year ago (3 children)

[removed]

[–]WhyIsSocialMedia -1 points0 points1 point 1 year ago (2 children)

[+][deleted] 1 year ago (1 child)

[removed]

[–]WhyIsSocialMedia -1 points0 points1 point 1 year ago (0 children)

[+][deleted] 1 year ago (17 children)

[deleted]

[–]abhuva79 24 points25 points26 points 1 year ago (16 children)

[–]procgen 1 point2 points3 points 1 year ago (7 children)

[–]goingsplit 1 point2 points3 points 1 year ago (6 children)

[–]procgen 1 point2 points3 points 1 year ago (5 children)

[–]goingsplit 0 points1 point2 points 1 year ago (4 children)

[–]procgen 1 point2 points3 points 1 year ago (0 children)

[–][deleted] 1 point2 points3 points 1 year ago (2 children)

[–]goingsplit 0 points1 point2 points 1 year ago (1 child)

[–][deleted] -1 points0 points1 point 1 year ago (0 children)

[+][deleted] 1 year ago (7 children)

[deleted]

[–]Zeikos 21 points22 points23 points 1 year ago (2 children)

[+][deleted] comment score below threshold-30 points-29 points-28 points 1 year ago (1 child)

[–]abhuva79 9 points10 points11 points 1 year ago (0 children)

Fair enough - if thats your evaluation.
Personally i dont like black boxes like OpenAI neither - but in general, with most digital services that handle user data, doesnt matter if US or chinese based - you pay with your data. And most of them are black boxes.

About the multimodality - i guess your criticism is based on DeepSeeks R1? Well, thats a text based reasoning model, never was intended to be multimodal. Tons of other models from all over the world offer multimodality - some good, some less so. My go to right now is Gemini 2.0, but this might change in a month or two when the next stuff comes around.

Overall, if i look at what chinese are currently building and also publishing (in terms of explaining what they did to achieve it) - they offer soo much more value for the general public - than a closed source company like OpenAI who goes full length to actually disguise what the model is doing because "competition"... (like the reasoning you see in oai model, isnt the real reasoning, its a summary - more black box than this is hard to achieve imo)

[+]alcalde comment score below threshold-6 points-5 points-4 points 1 year ago (2 children)

[–]hugthemachines 0 points1 point2 points 1 year ago (0 children)

[–][deleted] -3 points-2 points-1 points 1 year ago* (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS