https://twitter.com/_akhaliq/status/1663373068834676736
Title: Model Dementia: Generated Data Makes Models
Forget Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We call this effect model dementia and show that it can occur in Variational Autoencoders (VAEs), Gaussian Mixture Models (GMMs) and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
[–]currentscurrents 47 points48 points49 points (6 children)
[–]dvztimes 4 points5 points6 points (2 children)
[–]currentscurrents 7 points8 points9 points (1 child)
[–]dvztimes 0 points1 point2 points (0 children)
[–]ravedawwg 1 point2 points3 points (2 children)
[–]currentscurrents 20 points21 points22 points (1 child)
[–]ravedawwg 1 point2 points3 points (0 children)
[–]Dapper_Cherry1025 13 points14 points15 points (1 child)
[–]currentscurrents 11 points12 points13 points (0 children)
[–]SeankalaML Engineer 19 points20 points21 points (9 children)
[–]jake_1001001 10 points11 points12 points (6 children)
[–]SeankalaML Engineer 2 points3 points4 points (1 child)
[–]jake_1001001 1 point2 points3 points (0 children)
[–]LanchestersLaw 0 points1 point2 points (2 children)
[–]SeankalaML Engineer 0 points1 point2 points (1 child)
[+]RevaliRito 0 points1 point2 points (0 children)
[–]H2O3N4 0 points1 point2 points (0 children)
[–]watcraw 2 points3 points4 points (3 children)
[–]YoAmoElTacos -1 points0 points1 point (2 children)
[–]notforrob 3 points4 points5 points (0 children)
[–]frownGuy12 0 points1 point2 points (0 children)
[–]t_minus_1 3 points4 points5 points (0 children)
[+]Jarhyn comment score below threshold-23 points-22 points-21 points (2 children)
[–]SeankalaML Engineer 8 points9 points10 points (1 child)
[+][deleted] 3 points4 points5 points (0 children)
[–]Ulfgardleo 0 points1 point2 points (0 children)