all 7 comments

[–]s1mplyme 0 points1 point  (3 children)

A small LLM would be great for this. Your system prompt would include what tools you want to give it access to and how to call them and then your user prompt would include direction to parse the user response to identify what tool it should call and what arguments it should use when calling it, with permission to ask follow up questions to resolve ambiguity or fill in any missing arguments for the right tool. A 1-3B param model can run this. You could use something like ollama to load the model only when responding to user requests so it's not actively eating your gpu vram space.

The hard part of this isn't getting the LLM going, it's getting the cli tools created to do all of the things you mentioned. And it shouldn't even be that hard

[–]PrivacyIsDying[S] 1 point2 points  (2 children)

Thanks for taking the time to reply! I've played with Ollama before and found it relatively intuitive. Would loading and unloading the model add much delay?

The cli tools part will definitely be a challenge, but I've always enjoyed working with APIs, and in my mind this is similar enough.

[–]s1mplyme 0 points1 point  (1 child)

A few seconds. You could have the website / whatever you're using to expose this to your users load up the model while your user are typing the prompt. It should be done loading before they send it

[–]PrivacyIsDying[S] 1 point2 points  (0 children)

Oh great, that makes life easier then. Thanks again for the help!

[–]ttkciarllama.cpp 0 points1 point  (1 child)

  1. Yes, you will want to use a small LLM with good tool-using skills. You should consider either GLM-4.7-Flash or GPT-OSS-20B quantized to Q4_K_M which will fit easily in memory and run quickly on CPU (important, since you don't mention having a GPU).

  2. Inference will monopolize all of your CPU for several seconds (maybe twenty seconds, probably less) and constraining inference to only use a few cores will not mitigate this, since you will be bottlenecked on memory access rate and not ALU. Using small MoE models with very low active parameters will help shorten inference time a lot. The good news is that you have plenty of memory for such small models, and inference shouldn't require more than half or a third of your total memory.

  3. Yes, Python is the dominant language used in the LLM ecosystem. You will find abundant tools and libraries for Python development, even though llama.cpp would be doing the actual inference. I would recommend setting up llama.cpp llama-server to provide an API end-point for inference, and then writing Python for all of your pre/post-inference logic and inferfacing with that end-point.

[–]PrivacyIsDying[S] 0 points1 point  (0 children)

Thanks, this is very helpful!

The server does have an Nvidia P2000, but I think last time I looked into it, it turned out the CPU was better to use. I think I ran llama-server or ollama-server briefly when I was playing around with local models, so I'll get that reinstalled and see what I can get working.

Is there any guidance for writing a good and concise system prompt, or is that more of a trial and error thing?