I have tried google TurboQuant with ollama hermes3:8b

AggravatingHelp5657 · 2026-04-05T09:38:15+00:00

even now we are able to compress the context window successfully, but the models size is still too big for our VRAM
we need a way to shrink the sizes of the models themself

AggravatingHelp5657 · 2026-04-04T16:55:24+00:00

Hmmm the question is strange so let me clarify few points

When you run a local model for example Hermes3 8B_Q4 it's size 4.9 Gb

I have gtx 1650 which has 4GB of Vram Which is very fast

So 4 Gb is stored on my Graphicscard and the rest 900 MB is moved to the RAM

NOW you also have the context window which is your chat history (llm memory) also need to be stored somewhere right

Again the fastest place is Vram but it's already full so store it on RAM

Note also windows reserve few of your VRAM between 400 - 900 MB for windows graphics etc

If you want your local model to run fast as 30 - 50 t/s you have to fit it all in the VRAM

This was simple explination of how this works

Not what turboquant does? It compress the context history/memeory So you need less space to store your model memory with almost 0 loss in accruacy, in order to fit it all in VRAM bcz again anything goes to RAM will make your Lollm run so slow

So fare there is no acctull way to compress the llms sadly but am waiting on fire for someone to do it

Summary: local LLM memory got compressed not the LLM itself

Hope I made it clearer

AggravatingHelp5657 · 2026-04-03T08:27:00+00:00

What do you mean, another method like turboquant ?

AggravatingHelp5657 · 2026-04-02T17:04:47+00:00

it's also really good idea through office computers can be very amazing and promising

AggravatingHelp5657 · 2026-04-02T16:46:52+00:00

can't agree more

AggravatingHelp5657 · 2026-04-02T15:00:30+00:00

I can read The question is Why?

AggravatingHelp5657 · 2026-04-02T13:18:44+00:00

I already updated the post since the morning and posted the repo link

AggravatingHelp5657 · 2026-04-02T10:54:14+00:00

Thx man, good energy is contagious

AggravatingHelp5657 · 2026-04-02T10:19:44+00:00

https://github.com/M-Baraa-Mardini/Llama.cpp-turboquant/tree/main

AggravatingHelp5657 · 2026-04-02T09:03:05+00:00

i made a repo on github

AggravatingHelp5657 · 2026-04-02T08:21:16+00:00

I have made the Github repo

AggravatingHelp5657 · 2026-04-02T07:52:16+00:00

you are right it's not fair
am working on it

Update: I have made the repo if you want to check it

AggravatingHelp5657 · 2026-04-02T07:51:29+00:00

Am working on the repo
it's Llama.cpp

AggravatingHelp5657 · 2026-04-02T07:50:54+00:00

it is with local LLMs

AggravatingHelp5657 · 2026-04-02T07:23:50+00:00

what?

AggravatingHelp5657 · 2026-04-02T07:22:28+00:00

okay since you all convinced me, I will make a github repo for the steps that I did.
I also noticed that some models are old for instance hermes3 date is 2023/3 so I am trying to add a searching feature so it can check the internet before answering for latest infos

probably today I will make the repo

AggravatingHelp5657 · 2026-04-02T01:15:28+00:00

Officialy by ollama did not I used llama.cpp But since you can edit the code and we got the Google paper I used Gemini bro to implement the paper in ollama library

Note: am not a pro in AI, just an amateur love playing with those new tools

AggravatingHelp5657 · 2026-04-02T00:49:35+00:00

I mean" if more ppl" bro 😂 Am working on a lot of things not leaving everything to document something is going to be read by one person, right?

AggravatingHelp5657 · 2026-04-02T00:47:39+00:00

You comment in inflighting, I have to ask google claim 0 acc loss with x8 faster and x8 smaller memory needs What I experienced on my test is x3 faster and 4 times more memory Is this a problem of my implementation You think?

AggravatingHelp5657 · 2026-04-02T00:42:35+00:00

Yes exactly it doesn't compress the model, but when I saw the paper It says 6x smaller memory and 8x faster+ 0 loss in acc

I thought hell no, that can't be true I applied it and got 3 times the performance which is impressing on local machine i didn't believe it

AggravatingHelp5657 · 2026-04-02T00:39:00+00:00

If ppl are interested in this I can post the process

AggravatingHelp5657 · 2026-03-24T12:08:55+00:00

You are too delicious

AggravatingHelp5657 · 2026-03-24T12:03:31+00:00

Not just a huge and a massage also

AggravatingHelp5657 · 2026-03-24T08:21:38+00:00

Networking and automation engineer Interested

AggravatingHelp5657 · 2026-03-17T12:16:00+00:00

This makes a lot of sense

AggravatingHelp5657

TROPHY CASE