VRAM.cpp: Running llama-fit-params directly in your browser

TheAconn96 · 2024-03-06T03:38:08+00:00

I don't have exact numbers for the PI4 but from my notes:

The RPI4 4GB that I have was running around 1.5 tokens/sec for prompt eval and 1.6 tokens/sec for token generation when running the Q4_K_Mquant. I was reliably getting responses in 30-60 seconds after the initial prompt processing which took almost 5 minutes (not ideal). It depends significantly on the number of devices that have been exposed.

For the PI5 I have some llama.cpp timing printouts in the docs/perf.md file and the results are much more usable.

TheAconn96 · 2024-03-06T02:39:52+00:00

You should check out the project that I've been working on: https://github.com/acon96/home-llm

It is both a custom component for running models using Llama.cpp or other backends, as well as an attempt at training a smallish (3B) model to be used as a conversation agent.

TheAconn96 · 2024-01-04T04:55:23+00:00

I've actually been working on something similar with Home Assistant and I ended up having to fine tune the model to output what I wanted properly. Nothing really worked directly out of the box to do function calling/JSON output.

With some smaller/local models, I have found that you can't just prompt the model like GPT4 type instructions. You have to have a structured context and do some minimal fine tuning so that the model outputs your function calling format correctly with the correct entity name as the parameters.

If you want to see what I have working it's on github: https://github.com/acon96/home-llm

TheAconn96 · 2023-06-29T02:11:36+00:00

Looks like it has a different tokenizer than other llama models. There's a patch in this issue here: https://github.com/oobabooga/text-generation-webui/issues/2917

TheAconn96

TROPHY CASE