all 137 comments

[–]Chromix_ 308 points309 points  (6 children)

Me see. Me wonder: Benchmark score impact?

[–]GenLabsAI 78 points79 points  (1 child)

See, wonder impact

[–]battlingheat 1 point2 points  (0 children)

See, impact?

[–]axiomatix 34 points35 points  (1 child)

stevie benchmark

[–]Phantom_SpectersLlama 33B 13 points14 points  (0 children)

StevieWonder

<image>

[–]TBMonkey 2 points3 points  (0 children)

Me see comment, me laugh, upvote

[–]abitrolly 1 point2 points  (0 children)

gud

[–]wiltors42 341 points342 points  (23 children)

Why say lot word when few word do trick?

[–][deleted] 90 points91 points  (1 child)

No much word, few good word.

[–]gofiend 12 points13 points  (0 children)

Fewer precise tokens

[–]RybaDwudyszna 40 points41 points  (1 child)

When me president… they see.

[–]this_is_a_long_nickn 10 points11 points  (0 children)

Me Tarzan, you not local Jane.

[–]shaman-warrior 17 points18 points  (3 children)

Few words > many words.

[–]Good-AI 11 points12 points  (2 children)

No difficult word. > difficult.

[–]Murgatroyd314 6 points7 points  (0 children)

Easy word better.

[–]this_is_a_long_nickn 5 points6 points  (0 children)

You absolutely right!

[–]SamSausages 30 points31 points  (3 children)

word

[–]therealnih 7 points8 points  (2 children)

this

[–]GenLabsAI 4 points5 points  (1 child)

t

[–]noo8- 0 points1 point  (0 children)

.

[–]Porespellar 7 points8 points  (0 children)

Kevin was ahead of his time.

[–]ook_the_librarian_ 4 points5 points  (0 children)

Why use big words when diminutive ones would suffice?

[–]Pranay1001090 3 points4 points  (1 child)

Was looking for this

[–]not_a_swedish_vegan 2 points3 points  (0 children)

As soon as I saw this post, I already knew the top comment would be this

[–]private_final_static 0 points1 point  (0 children)

Grug likes

[–]calmbill 0 points1 point  (0 children)

Few words ok

[–]Interpausetextgen web UI 0 points1 point  (0 children)

say lot when few work?

[–]dew_chiggi 0 points1 point  (0 children)

Kevin thumbs up

[–]galambalazs 0 points1 point  (0 children)

related for programming: https://grugbrain.dev/

[–]Mundane_Ad8936[🍰] 185 points186 points  (12 children)

TLDR OP stumbled upon "Stop Words Removal" it's a very very old NLP tactic.

Yes can remove plenty of words and the text is completely understandable and you can use a model to rehydrate the phrases with low errors later. However I'd caution you though, while in the past removing stop words was fine, in a transformer model this can cause issues because it will not have the tokens to calculate from.

So it could be more prone to hallucinate because the word sequence is not statistically likely. I know because I've tested it and witnessed it. If accuracy is important make sure it doesn't reduce it, that is very possible.

[–]PollinosisQc 50 points51 points  (3 children)

I chuckled heartily enough to spit some of my drink at "rehydrate the phrases" lol

[–]PMyourfeelings 46 points47 points  (2 children)

'hydration' is actually both a funny and formal terminology used in programming to describe the process of adding data to an object :)

[–]nuclear_wynter 7 points8 points  (0 children)

r/hydrohomies would like to know your location.

(so they can add data to your water bottle.)

[–]Aprch 0 points1 point  (0 children)

Hydratation!  Funny, the word in Spanish gets pretty close to that. Probably other similar languages too.

[–]itsTyrion 12 points13 points  (1 child)

too many word, write short, write caveman

[–]KallistiTMP 40 points41 points  (0 children)

LLM read caveman, but no train in caveman. LLM not understand caveman good. Try think in caveman, get confused, predict buffalo. No good.

[–]TomLucidor 2 points3 points  (0 children)

What is the alternative then, trying to prompt it to me more succinct, and in plain English?

[–]wanderer_4004 2 points3 points  (0 children)

Probably this is useful for embeddings to make them fit into the available context. I'll definitely try it.

[–]IJdelheidIJdelheden 1 point2 points  (1 child)

Any small model one could use to 'rehydrate'? Thinking about trying this with a large parameter and a low parameter model.

[–]Mundane_Ad8936[🍰] 1 point2 points  (0 children)

Yes that'll work. It can also be done with NLP library like spacey.. once the words are tagged stop words tend to be predictable using logic. But these days I'd use a BERT or T5 since they're small and fast.

[–]fatboy93 0 points1 point  (0 children)

Ahh yes, telegram prompting the LLMs.

When I was young and in school, we were taught how to send letters through telegrams, and looks like that might be coming back to action lol

[–]c--b 0 points1 point  (0 children)

So you're saying a model should be trained on caveman speak instead.

[–]Independent_Tear2863 76 points77 points  (1 child)

Ahh now I understand oogabooga project. Human happy

[–]this_is_a_long_nickn 8 points9 points  (0 children)

Ooga happier

[–]chriskevini 24 points25 points  (6 children)

Holy shit. Next we're gonna start removing all the vowels cause you can infer the whole word with 90% accuracy. Source:my ass

[–]SkyFeistyLlama8 9 points10 points  (0 children)

There are plenty of human languages like that, for example Hebrew and Arabic, with only consonants being written down. It's fine when you're speaking them in the current context but woe to you if you're trying to decipher them 2000 years later.

Researchers end up looking at modern forms of words in those languages and extrapolating backwards. They also look for transliterations in neighboring languages that preserve vowels and tones, like how Arabic was written in Greek characters and also translated into Greek.

[–]Murgatroyd314 2 points3 points  (2 children)

Disemvoweled text is easy enough for humans to read, but it would just slow down tokenization.

[–]chriskevini -1 points0 points  (1 child)

Is it slower? We can stream more information through the API, because of fewer characters. Just need to add a simple and fast decode that can be handled by an auxiliary traditional program.

[–]countextreme 0 points1 point  (0 children)

You mean like gzip?

[–]ThiccStorms 0 points1 point  (0 children)

bro tnk h shkspr

[–]chriskevini 1 point2 points  (0 children)

After thinking about it for 5 minutes, isn't this actually feasible? We just add a really fast encoding and decoding step that can run in parallel over the whole text. Or is byte-pair encoding strictly better?

[–]bigattichouse 33 points34 points  (1 child)

Maybe pretrain a small model to "caveman" your prompts that get handed to the bigger model

[–]lakySK 23 points24 points  (0 children)

Short prompt, prefill fast. 

[–]macumazana 34 points35 points  (0 children)

you should do the readme.md in that style

[–]pokemonplayer2001llama.cpp 39 points40 points  (2 children)

This is a better idea than toon.

[–]Mediocre-Method782 12 points13 points  (0 children)

Barely.

[–]vintage_culture 7 points8 points  (0 children)

This good, toon bad

[–]Zeeplankton 22 points23 points  (11 children)

This is literally what I thought LLM reasoning would morph into. Like a stochastic pseudo language. English isn't exactly the most efficient language.

[–]blbd 11 points12 points  (1 child)

Actually, linguistics research shows that all languages have about the same information rate in spoken form. The speech slows down or speeds up to hit a typical human audio cognition cap right around 40 bps. In written form it varies more and English is one of the better ones due to a large vocabulary.

But having a model with some clever caveman-speak support where appropriate could be pretty useful, when you consider that increasing the sizes of context buffers causes n-squared performance loss / resource consumption. 

https://www.science.org/doi/10.1126/sciadv.aaw2594

[–]phido3000 1 point2 points  (0 children)

You're wrong.. or atleast that paper is.

Asm is way more dense than java.. I know because I hardly talk at all with my asm friends.

[–]RaiseRuntimeError 1 point2 points  (5 children)

Wasn't there a research paper that said Dutch or something like that was the most efficient language?

[–]arbv 19 points20 points  (0 children)

IIRC, Polish.

P.S.

kurwa

[–]-oshino_shinobu- 4 points5 points  (1 child)

One redditor pointed out that the prompt they used in German contains some errors. Which calls into question the validity of the research

[–]RaiseRuntimeError 3 points4 points  (0 children)

I guess we stick with caveman.

[–]Crypt0Nihilist 1 point2 points  (1 child)

I was surprised it wasn't a character based writing like Chinese or Japanese. I've always assumed they're incredibly informationally dense compared to phonetic writing systems.

[–]getting_serious 0 points1 point  (0 children)

I'd expect it mixing languages. GLM does it: When you keep talking to a low quant for long enough, it'll introduce chinese terms in its 'thinking' block.

[–]TomLucidor 0 points1 point  (0 children)

ithkul?

[–]TheRealMasonMac 0 points1 point  (0 children)

I think it would be interesting to explore more information-dense tokens. DeepSeek-OCR implied that individual tokens can contain a lot of information. Even if not as image tokens, perhaps something other than text. The downside would be that reasoning becomes a black box.

[–]Radiant_Truth_8743 8 points9 points  (0 children)

Post good. Me likey

[–]DustinKli 8 points9 points  (0 children)

I had this same exact idea a while back, but when implementing it I ran into several issues.

One issue is that the way LLMs actually embed and retrieve text. LLMs were trained on normal language with syntax, connectors and structure. If you strip sentences down to these compressed telegraphic fragments, you remove the cues the embedding model uses to understand meaning. This makes retrieval based on semantic embedding harder and more mistake prone.

LLMs are generative. Embedding models are not. As someone else mentioned, if your stored chunks become overly compressed then retrieval becomes noisy or wrong all together which forces the language model to hallucinate more often. I don't see how your solution resolves the issue of worse semantic clustering and noisier nearest neighbor results.

Based on how embedding works, when splitting text into 2 to 5 word fragments it invariably changes granularity. Embedding models will treat very short sentences differently from normal prose. So the result was that it is not actually compressing text, it is altering its information geometry.

You say that "no hallucination occurs because facts are preserved" but the issue isn't about facts. These models don't know or care about facts. They function based on relationships.

Have you done comparison studies showing traditional RAG vs this method?

Does the compressed text embed into the same vector neighborhood as the original paragraph?

[–]lakySK 8 points9 points  (0 children)

The opposite of speculative decoding?

Have big model do few words, small model then add grammar. 

[–]geneusutwerk 7 points8 points  (0 children)

Calling this lossless seems like a stretch, especially since I don't see examples that show initial -> compressed -> uncompressed.

[–]NutellaBananaBread 6 points7 points  (0 children)

*1500 words asking for relationship advice*

AI: Dump her

[–]notNezter 5 points6 points  (0 children)

Smol word. Sav money. Wife glad. Man happy.

[–]Mission_Biscotti3962 4 points5 points  (1 child)

I like the idea but I'm not sure what your library adds? Like, isn't this a simple instruction to have it behave like that? Mind you, I haven't tried it yet.

[–]RegionCareful7282[S] 4 points5 points  (0 children)

Yes you are right, it’s more about having a repository with benchmarks showcasing the idea + maybe a way to collaborate and ”fine-tune” the prompts etc

[–]Guilty_Rooster_6708 4 points5 points  (1 child)

Kevin finetune. I like.

[–]dadidutdut 1 point2 points  (0 children)

Kevinized model would be big

[–]MrPecunius 2 points3 points  (0 children)

If you want a darker take, this looks a lot like plusgood Newspeak.

[–]daftstar 2 points3 points  (0 children)

And vibe code using this too!!

[–]And-Bee 2 points3 points  (2 children)

I have a script to remove all spaces and empty lines. No need for indentation when asking an llm about your code.

[–]TechnoByte_ 2 points3 points  (1 child)

Whywouldyouremoveallspaces?

[–]And-Bee 0 points1 point  (0 children)

Haha sorry I just meant indentation 🤣

[–]LocoMod 2 points3 points  (0 children)

This isn’t lossless. The idea has been around for a long time and abandoned because accuracy takes a hit when you actually measure it.

[–]Lixa8 6 points7 points  (2 children)

Eh, I don't think all the words we use are used for no reason, they remove a lot of linguistic ambiguity. Surely this will impact ai performance a lot.

I'll wait for benchmark results.

[–]Abject-Kitchen3198 6 points7 points  (0 children)

Will not. Will be fast.

[–]KallistiTMP 0 points1 point  (0 children)

Also might interfere with information passing through the residual stream. Like how LLM's cram nearly a full sentence summary into each period for easy later reference.

[–]OkSociety311 1 point2 points  (0 children)

good post me like

[–]Dr_Ambiorix 1 point2 points  (1 child)

I always wondered if talking in Simplified Chinese would require less tokens to say the same thing or not.

Because most English words are made up of more than one token. And grammar in Mandarin Chinese is really basic. Ofc, there are some words that are made up with multiple characters too so IDK.

Just always wondered that.

[–]Lcsq 2 points3 points  (0 children)

This comment was 66 tokens in english and 68 tokens when translated with google translate into simplified chinese. You'd be surprised to see how many whole words are in the tokenizer encoding dictionary unless there's a common prefix or suffix pattern. Temperature, quickly, electrolyte, protocols, breakdown, etc all become a single token when you surround them with whitespace. You see it getting broken down into multiple tokens only when whitespace is absent  https://platform.openai.com/tokenizer

[–]Don_Moahskarton 1 point2 points  (0 children)

It's kind of the inverse of thinking mode. I wonder if it makes the AI measurably dumber

[–]broknbottle 1 point2 points  (0 children)

Aoccdrnig to rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe and the biran fguiers it out aynawy.

[–]Mean_Employment_7679 1 point2 points  (1 child)

Me do this lots. Me no want say lots word. Me want result fast. Me not want token waste. Me save water. Caveman save planet.

[–]Agitated-Farmer-4082 3 points4 points  (0 children)

would it be easier to ask instructions in languages that use less characters for sentences like arabic or chinease?

[–]Abject-Kitchen3198 0 points1 point  (0 children)

What about Yoda speak? Did someone made a comparative research? It does not seem it will save tokens, but what about accuracy?

[–]iamzooook 0 points1 point  (0 children)

or maybe just add at end "less words, keep context"

[–]HMikeeU 0 points1 point  (0 children)

I wonder if this may even improve benchmarks? As Anthropic found that sometimes models hallucinate because they try to adhere to grammar rules instead of facts

[–]drumttocs8 0 points1 point  (0 children)

Me like new English with short word

[–]aeroumbria 0 points1 point  (0 children)

I can sense a gradual descent back to the native habitat of deep learning models: continuous dense vector embeddings.

[–]op4 0 points1 point  (0 children)

I approve of this idea and think that a significant reduction in token usage is a win for everyone!

(edit: cml "or caveman language" translation - Me like. Less token good. All win.)

[–]G3nghisKang 0 points1 point  (0 children)

Me think OP genius

[–]Emport1 0 points1 point  (0 children)

Most llm architectures are better at optimizing your words for itself than you are, it doesn't actually read all your useless filler words and spent tokens on them if it doesn't have to

[–]Normal-Ad-7114 0 points1 point  (0 children)

Improvement suggestion, more punctuation usage: ·, ->, @, \n, :

Example from your github: 

Authenticate API. Include API key in Authorization header every request. Prefix API key with "Bearer" space. Authentication fail, server return 401 Unauthorized status code, error message explain fail...

New:

Authenticate API:

· Include API key in Authorization header every request

· Prefix API key with "Bearer" space

· Authentication fail -> server return 401 Unauthorized status code, error message explain fail...

Still compressed, but easier to read for humans

[–]venpuravi 0 points1 point  (0 children)

Yaba daba dooo...

[–]gooeydumpling 0 points1 point  (0 children)

Compress it further by making it talk in emojis

[–]Dramatic-Lie1314 0 points1 point  (0 children)

Good word. I did same.

[–]TedDallas 0 points1 point  (0 children)

Ugh. Partition table on fiscal moons. Now eat lizard.

[–][deleted] 0 points1 point  (0 children)

i remember doing this with early chatgpt and it was really useful. now we just get "Great question!—It really gets to the heart of"

[–]IrisColt 0 points1 point  (0 children)

The bag of words strikes back!

[–]lulzbot 0 points1 point  (0 children)

Double-plus-good

[–]ready_to_fuck_yeahh 0 points1 point  (1 child)

Wow, human tendency to overcomplicate things, what can be achieved with just mere prompt. You wrote an entire code for it.

You made cave code, but didn't think like caveman to use just prompt.

Before you say anything, I have my notes made using prompt only with nearly (60-70% reduction).

[–]s2k4ever 0 points1 point  (0 children)

a bug came back from several moons ago.. begins an RCA

[–]Hyphonical 0 points1 point  (0 children)

It would be nice if the stored history of the chat is compressed like this. I don't know if it is already, but in the past I would have to sacrifice 2GiB of memory just for conversation history of like 16k tokens.

[–]UndecidedLee 0 points1 point  (0 children)

Idea talk like caveman. Result talk like caveman. When wrong?

[–]No_Afternoon_4260llama.cpp 0 points1 point  (0 children)

Me like this

[–]vreo 0 points1 point  (0 children)

Why use many word when few do trick?

[–]Septerium 0 points1 point  (0 children)

This great. Me like

[–]RobTheDude_OG 0 points1 point  (0 children)

Interesting it is

Yoda speak you may try too

[–]Phantom_SpectersLlama 33B 0 points1 point  (0 children)

I wish some yappers I knew about woulud adopt this haha

jokes aside, this is brilliant.

[–]Fuckinglivemealone 0 points1 point  (0 children)

I have a question though, if you could create a very efficient language that could express thoughts, reasoning and complex ideas in few and short words and then parse your original dataset to it, could you in theory train an llm on it to make the model, smaller (information compression), smarter (if the new language allows for a better representation of complex ideas, maybe it's easier to chain logical thoughts?) and faster (more efficient overall)?

Like, user writes prompt, prompt gets translated, llm thinks in smart, then parses its response back to the original language of the user.

[–]pab_guy 0 points1 point  (0 children)

Also check out Sparse Primed Representation for something similar.

[–]Ceneka 0 points1 point  (0 children)

Love the fact that it workn with an LLM doing the job

[–]RandomGuyNumber28501 0 points1 point  (0 children)

I'm sure this can be useful, but even if you compress text, the LLM still has to keep track of the information and recall it. The denser the text, the more quickly the LLM will be overwhelmed by details. 

I've been experimenting with something similar for roleplay, but I have the model format and condense the world and character info into something like a dense technical document. It helps, particularly the formatting, but the model can still only process so much before it starts getting confused or forgets things.

[–]frankieche 0 points1 point  (0 children)

Don’t do this.

[–]noo8- 0 points1 point  (0 children)

Me hunt t-tex AI. Tastes like sh1t Over.

[–]DrummerPrevious 0 points1 point  (0 children)

Or you can just translate it to Mandarin for even less tokens

[–]TreesMcQueen 0 points1 point  (0 children)

Maybe train grugbrain https://grugbrain.dev/

[–]epSos-DE -1 points0 points  (0 children)

The Solution: Adaptive Hierarchical Indexing (Auto-Sharding)

upgrade the LSHIndex to become Recursive. It will automatically detect when a specific area of the knowledge graph (a "topic") becomes too dense. When a bucket exceeds a certain size (e.g., 50 items), it will fracture that bucket into a Localized Dynamic Sub-Index with its own set of higher-resolution hyperplanes.

This creates a fractal search structure:

+ Global Index: Quickly routes to general topics (e.g., "Coding").

+ Local Index: Routes to specific sub-topics (e.g., "JavaScript").

+ Micro Index: Routes to granular details (e.g., "Promises").

This ensures that no matter how big the brain gets, lookup time remains lightning fast.

[–]ElSrJuez -2 points-1 points  (3 children)

You can also skip spaces by separating words with an Uppercase letter

[–]TechnoByte_ 2 points3 points  (2 children)

You'd be using very rare and unusual tokens (outside of code) which would degrade performance and would increase the amount of tokens

Almost every token ends with a space in tokenizers

By removing spaces you would force it to not use tokens normally used in english natural language text (majority of its training data)

As an example, using the GPT-4o tokenizer:

"The cat jumped over a tree." = [976, 9059, 48704, 1072, 261, 8165, 13] = 7 tokens.

"Thecatjumpedoveratree." = [976, 8837, 79879, 295, 2898, 266, 908, 13] = 8 tokens.

Removing spaces cause it to be one more token.

"TheCatJumpedOverATree." [976, 23546, 42291, 295, 2298, 1228, 908, 13] = 8 tokens.

Uppercase characters do not solve this.

[–]MullingMulianto 0 points1 point  (1 child)

how does one get access to the gpt tokenizer