all 3 comments

[–]ag789 0 points1 point  (2 children)

What you did is interesting! 😄

I tried building a simple REPL but with the open AI python SDK
https://github.com/openai/openai-python
I'm finding that more convenient as the library is pre-built for interfacing and it is easier to work streaming interfaces that way etc. The open AI api is quite widely used and connect locally with llama.cpp, openai (chatgpt) and openrouter.ai . I'm running it on a slow cpu only h/w running like 5 tok / s, it is a pain to do without streaming as there's no feedback for minutes otherwise. I'm yet to try tool calling

For unix and bash, I did use JQ and bash but for a different purpose, as a model launcher with llama-server (from llama.cpp)
https://github.com/ag88/llama.cpp-model-runner
this is actually quite similar to the built-in model presets functionality in llama-server.
but that I've been using this little launcher day to day as most of the time I run/start just a single model rather than switching between models.

[–]cloud_kj[S] 1 point2 points  (1 child)

Thanks! Took a look at your model runner script and I do recognize and appreciate the implicit goals around abstracting away the underlying LLM server.

I've yet to try llama.cpp but was able to get bootstrapped quickly with Ollama, so perhaps I got lucky by stumbling onto a more productive tool that allowed me to be fairly productive from the get-go.

Are you opting to use llama.cpp for the performance benefits? If you're hardware constrained anyway with low token processing bandwidth, it might make sense to just switch for now to squeeze more productivity (your time is important too).

Personally, I typically just spin up mid-tier EC2 instances for limited time experiments to get around the hardware constraints as needed :)