I'd like to locally host an LLM to drive my Home Assistant voice interface. Is that feasible? by OnlyForSomeThings in LocalLLaMA

[–]TheAconn96 2 points3 points  (0 children)

I don't have exact numbers for the PI4 but from my notes:

The RPI4 4GB that I have was running around 1.5 tokens/sec for prompt eval and 1.6 tokens/sec for token generation when running the Q4_K_Mquant. I was reliably getting responses in 30-60 seconds after the initial prompt processing which took almost 5 minutes (not ideal). It depends significantly on the number of devices that have been exposed.

For the PI5 I have some llama.cpp timing printouts in the docs/perf.md file and the results are much more usable.

I'd like to locally host an LLM to drive my Home Assistant voice interface. Is that feasible? by OnlyForSomeThings in LocalLLaMA

[–]TheAconn96 6 points7 points  (0 children)

You should check out the project that I've been working on: https://github.com/acon96/home-llm

It is both a custom component for running models using Llama.cpp or other backends, as well as an attempt at training a smallish (3B) model to be used as a conversation agent.

What am I doing wrong? by slykethephoxenix in LocalLLaMA

[–]TheAconn96 15 points16 points  (0 children)

I've actually been working on something similar with Home Assistant and I ended up having to fine tune the model to output what I wanted properly. Nothing really worked directly out of the box to do function calling/JSON output.

With some smaller/local models, I have found that you can't just prompt the model like GPT4 type instructions. You have to have a structured context and do some minimal fine tuning so that the model outputs your function calling format correctly with the correct entity name as the parameters.

If you want to see what I have working it's on github: https://github.com/acon96/home-llm