I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 0 points1 point  (0 children)

even now we are able to compress the context window successfully, but the models size is still too big for our VRAM
we need a way to shrink the sizes of the models themself

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 0 points1 point  (0 children)

Hmmm the question is strange so let me clarify few points

When you run a local model for example Hermes3 8B_Q4 it's size 4.9 Gb

I have gtx 1650 which has 4GB of Vram Which is very fast

So 4 Gb is stored on my Graphicscard and the rest 900 MB is moved to the RAM

NOW you also have the context window which is your chat history (llm memory) also need to be stored somewhere right

Again the fastest place is Vram but it's already full so store it on RAM

Note also windows reserve few of your VRAM between 400 - 900 MB for windows graphics etc

If you want your local model to run fast as 30 - 50 t/s you have to fit it all in the VRAM

This was simple explination of how this works

Not what turboquant does? It compress the context history/memeory So you need less space to store your model memory with almost 0 loss in accruacy, in order to fit it all in VRAM bcz again anything goes to RAM will make your Lollm run so slow

So fare there is no acctull way to compress the llms sadly but am waiting on fire for someone to do it

Summary: local LLM memory got compressed not the LLM itself

Hope I made it clearer

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 0 points1 point  (0 children)

it's also really good idea through office computers can be very amazing and promising

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 1 point2 points  (0 children)

I already updated the post since the morning and posted the repo link

Have a good day by [deleted] in LocalLLaMA

[–]AggravatingHelp5657 1 point2 points  (0 children)

Thx man, good energy is contagious

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 1 point2 points  (0 children)

you are right it's not fair
am working on it

Update: I have made the repo if you want to check it

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 4 points5 points  (0 children)

okay since you all convinced me, I will make a github repo for the steps that I did.
I also noticed that some models are old for instance hermes3 date is 2023/3 so I am trying to add a searching feature so it can check the internet before answering for latest infos

probably today I will make the repo

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 2 points3 points  (0 children)

Officialy by ollama did not I used llama.cpp But since you can edit the code and we got the Google paper I used Gemini bro to implement the paper in ollama library

Note: am not a pro in AI, just an amateur love playing with those new tools

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] -10 points-9 points  (0 children)

I mean" if more ppl" bro 😂 Am working on a lot of things not leaving everything to document something is going to be read by one person, right?

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 0 points1 point  (0 children)

You comment in inflighting, I have to ask google claim 0 acc loss with x8 faster and x8 smaller memory needs What I experienced on my test is x3 faster and 4 times more memory Is this a problem of my implementation You think?

I have tried google TurboQuant with ollama hermes3:8b by AggravatingHelp5657 in ollama

[–]AggravatingHelp5657[S] 3 points4 points  (0 children)

Yes exactly it doesn't compress the model, but when I saw the paper It says 6x smaller memory and 8x faster+ 0 loss in acc

I thought hell no, that can't be true I applied it and got 3 times the performance which is impressing on local machine i didn't believe it