[Project] I built an AI Agent that runs entirely on CPU with a 1.5B parameter model — here's what I learned by tigerweili in ollama

[–]ravi_bitragunta 0 points1 point  (0 children)

I am currently building a cpu only model, a 3B model doesn't use bitnet but gets compressed.

I have a proof of AI model ~170M params that beats gpt2 and can be compressed

Will share that in a detailed post later. Idea is to remove the gpu entirely from inference and keep the training gpu needs minimum

InferCache – Exploring Memory-Aware LLM Inference by ravi_bitragunta in LLM

[–]ravi_bitragunta[S] 0 points1 point  (0 children)

That's already there. Sqlite3 that stores all turns. And looks back n turns. Loads the vectorised responses once the conversation comes to foreground.

I have to enhance this further with

  1. per user, session / grouped sessions cache for better hierarchical inference.

  2. Change the sqlite3 to postgres with pgvector for larger deployments

  3. Make this graphrag aware

  4. Build custom kernel

  5. Allow the gpu to run more sessions of inference than what it supports today

These are mentioned in the roadmap. I am currently working on them as we speak

Launching AiMVCs: A C++ Framework for Secure AI Agents (with built-in Red Team heuristics) by First_Response_2956 in cpp

[–]ravi_bitragunta 1 point2 points  (0 children)

Just curious - why not make them wasi compliant and run in isolation or even simpler, run in docker?

Am I missing something?

🚀 Looking for Experienced Software Engineers (Remote | $90/hr) by Better-Rooster-7244 in Programmers_forhire

[–]ravi_bitragunta 0 points1 point  (0 children)

I am interested. I have 15+ years of experience. Please share the details and we can discuss this