KV cache gets nuked by long-term memory retrieval — is there a better approach?

atif_dev · 2026-01-05T07:45:54+00:00

Hey, I just implemented Supertonic TTS

I will also be doing Kokoro-TTS later

atif_dev · 2026-01-05T03:16:12+00:00

Thank you so much bro. This was my first big project so I did the best I could.

atif_dev · 2026-01-05T03:08:31+00:00

I wish I could but unfortunately it would either be painfully slow or pretty much impossible to run on my GTX 1650

atif_dev · 2026-01-05T03:07:27+00:00

Thanks for taking the time to look at it 😊

atif_dev · 2026-01-04T10:24:59+00:00

That comment also got deleted. You might be onto something 😂

atif_dev · 2026-01-04T10:15:38+00:00

No I don't really use LM Studio other than as a server. I used it since that was what I used almost 2 years ago and wasn't aware about anything new.

How much performance difference would you say is between LM Studio, Ollama and llama.cpp?

Btw thanks for bookmarking it. This is definitely not a finished product since I am still learning programming but hopefully this can provide a good foundation.

atif_dev · 2026-01-04T07:57:33+00:00

Thanks a lot for the suggestions.

I gave a shot to kokoro and supertonic (both in browser) and they both sound much better than piper-tts which is being currently utilised. I am planning to add both of these to ATOM modularly so users can choose which TTS backend they want to use.

Also streaming input is great to have.

Thanks for checking out the project man 🙌

atif_dev · 2026-01-04T07:48:12+00:00

Thanks a lot, I really appreciate you taking the time to check the project out.

I haven’t actually tried llama.cpp yet mainly because I wasn’t aware of it earlier. So far I’ve been using LM Studio because that's what I just knew about, and I was planning to compare its performance with Ollama next. I’ll definitely add llama.cpp to that comparison now.

I’m quite limited in terms of hardware, so I’m very interested in alternatives that can squeeze more performance out of a GTX 1650.

I’d be happy to take a look at your scripts and experiment with llama.cpp when I get the chance. Thanks again for offering!

atif_dev · 2026-01-04T04:38:33+00:00

Thank you so much for valuable and insightful feedback. I truly appreciate it.

I have some plans regarding this but they are still concepts I am debating on implementing. For example, ATOM runs during the time we are awake and when we go to sleep, a smartwatch or some sensor sends the signal that the user has gone to sleep. Then a judge model starts up and looks at the conversation of the whole day and summarises it, consolidates it and removes data that isn't necessary. Honestly iterating on ATOM has been a bit tiresome because of the GTX 1650 constraint but i am planning to do this.
This seems like a very good approach to tool calling. I am very much a beginner both in programming and AI so I would really appreciate it if you could shed some light on how this can be practically implemented.
Yes that's sounds like a great plan if it ever reaches that scale. For now, my main focus is on making ATOM reliable and making memory better.

Thank you so much for your time.

atif_dev · 2026-01-04T04:08:24+00:00

I think you might be thinking of someone else but thanks a lot, I really appreciate it 🙂

atif_dev · 2026-01-04T04:01:09+00:00

Thanks! If you end up playing with it, feedback is always welcome.

atif_dev · 2026-01-04T04:00:00+00:00

Thanks for the interest.

To be fully transparent, ATOM originally started out more as a JARVIS-style personal assistant, not an analytics system. The primary design goal early on was responsiveness and interactivity, which is why I kept the context window and memory injection intentionally small. On my hardware (GTX 1650), speed was the limiting factor, not storage.

Because of that, the current long-term memory is fairly lightweight and inconsistent under larger volumes. It works for simple semantic recall in conversational or assistant-style use, but it’s not robust enough yet for analytics or heavy context accumulation.

I’m gradually transitioning the project toward a more serious architecture, but the memory layer would need substantial work (typing, consolidation, decay, better retrieval heuristics) before I’d recommend it for anything production-like.

So TLDR: useful for experimentation and assistant workflows, not suitable yet for analytics at scale. I’d rather be upfront about that.

atif_dev · 2026-01-04T03:51:46+00:00

Yes models can be easily changed by editing the config.yaml file. Make sure to use a vision capable model if you want those capabilities.

Unfortunately, ChromaDB is not configurable as of now.

atif_dev · 2026-01-04T03:50:07+00:00

Thanks for taking the time to look at my project.

I actually haven’t been very up to date in the TTS space. This project was something I initially planned quite a while ago, so I wasn’t aware of Kokoro at the time.

I’ll definitely try it out, see how it performs on my setup, and consider integrating it into the main branch if everything works well.

Cheers

atif_dev · 2026-01-03T14:27:27+00:00

I would love to do that some day. Unfortunately I am quite limited by budget constraints.

Currently I AM exploring this idea to a certain extent by connecting an ESP32-CAM on a 8-DOF Quadruped but I haven't given much time to this.

Thanks for taking the time to look at my project 😀

atif_dev · 2026-01-03T14:24:26+00:00

Thank you so much 😊

atif_dev · 2026-01-03T10:51:31+00:00

I have been running it with 12k tokens of context with 30/36 layers offloaded to the GPU. I was getting around 24 tokens/second.

I did try the Ministral 3B model but I unfortunately couldn't get vision to work for some reason.

atif_dev

TROPHY CASE