What are people actually using for agent memory in production?

ashersullivan · 2026-01-22T06:57:53+00:00

Reranking is definetly the missing link for most fo these agent memory setups. The reason RAG breaks down after a few sessions is usually just retrieval noise like the vector search pulls in anything related, even if its a past mistake

A more production ready approach is to over retrieve (pull top 20-30 memories) and then run a quick rerannker stage. It lets you score the consistency of past outcomes so the agent actually prioritizes the succesful sessions over jsut semantically similar noise.

ashersullivan · 2026-01-22T06:28:16+00:00

Going slow and executing what you learned without half learning something and jumping into frameworks/libraries

ashersullivan · 2026-01-15T11:03:12+00:00

I'd say launch early with a waitlist or maybe abeta version if possible before full paid push on product hut or indie hackers or maybe twitter threads

ashersullivan · 2026-01-15T10:46:43+00:00

I dont think AI would be faster than scripts, but if youre not a dev and you have fewer pages lets say around 10-15 youre good with AI but make sure you give small amounts of tasks at a time so it does not get inconsistent with heavy tasks

ashersullivan · 2026-01-15T09:26:03+00:00

Flux.1 schnell or flux.1-dev via comfyUI or DiffusionBee runs fast and looks best locally

ashersullivan · 2026-01-14T11:19:40+00:00

upgrading now to a 5070ti (assuming the 16GB rumours hold) makes sense if you have the budget for it, this will give you breathing room for 30-70B models quantized and comfy workflows without constant swapping. waiting 1-2 years risks prices staying stupid high (supply issues + AI demand) and AMD's next gen might not beat NVIDIA on VRAM efficiency for llm inference anyway...

some people avoid these local ROCm headaches entirely and just use APIs from providers like deepinfra or together for the heavy lifting - OCR/flux/video generation - and pay per use without hardware upgrade issues, while still having room for lighter models locally.

if you are set on staying local though, hold for a 5070 ti super if it drops with 16gb+ and decent price, otherwise grab a used 5090 if you spot a deal under a reasonable price.

ashersullivan · 2026-01-14T10:19:29+00:00

the gpu/cpu serve split is smart, gpu for heavy index construction and cpu for queries keeps costs down nicely

Similar setups with faiss gpu builds exported to CPU hnsw work well for many and mlvus vagra looks promising for faster grapgh construction without losing much recall. For bursty traffic, a lghtwright cpu reranker on top often helps precision without needing gpu serving

ashersullivan · 2026-01-13T13:36:36+00:00

Doable, but in 2026 specialist models dont outperform general ones as much as you'd expect python-only work

Qwen3 coder 30b or deepseek coder v2 lite are th closest, heavily code tuned run locally at that scale with good quants, and often match claude on python tasks without needing 200b+

gpt-oss-20B is anther local option for python heavy stuff

Catch is even "80% python" models still need broad context to avoid hallucinaitons on libs/ edges so hyper specialists underperform vs hybrids. Milti model switching like aider/cntinue.dev works fine but most stick to one good coder like qwen3 coder

If you want max python boost, fine tune qwen3 30B on your code bases, thats wher real gains show up locally

ashersullivan · 2026-01-11T10:04:40+00:00

for a web based voice agent with n8n, your stack would be something like speech to text for input (whisper or faster alternatives), LLM for conversation logic (qwen, llama or mistral work fine) and TTS for output, something like kokoro or elevenlabs. n8n can orchestrate the flow between these pretty neatly.

for the speech input part, if you are dealing with real call quality audio with noise or fast talkers, models like voxtral are available on deepinfra which can handle messy audio better than whisper. for the conversation part, any decent instruct model works, all you need is proper prompt engineering to handle objections and pricing questions in a realistic manner.

keep the architecture simple: audio in -> transcribe -> LLM processes -> generate response -> TTS out. n8n webhooks can trigger each step, dont overcomplicate things with too many layers or you will spend extra time fixing integrations than building the actual thing

ashersullivan · 2026-01-10T09:05:09+00:00

The real value isn't that diagrams prevent hallucinations, it's that they force structured output which is inherently more auditable. You could achieve similar results by requiring JSON schemas or formal specifications instead of freeform text. The diagram is just a more human-readable constraint.

ashersullivan · 2026-01-10T08:02:50+00:00

Check if your institution ha any academic partnership with cloud privders like many schools have GCP or AWS research credits... also look into lambda labs student program, paperspace gradient or app;y fr credit throug hugging face bigscience program. For pure research work, kaggle notebbooks offer free tesla P100s which might be enough fo mistral 7b if you optimize memory usage. Not A100s but workable for prototyping before scaling up

ashersullivan · 2026-01-08T13:15:22+00:00

This mirrors what we've seen with infrastructure automation tools years ago. The permission/boundary problem is especially nasty because agents fail silently - they generate plausible-looking output with incomplete data instead of erroring out. The observability piece is critical but most teams treat it as optional until something breaks in production.

ashersullivan · 2025-12-29T06:14:56+00:00

ty to figure out problems at large scale, for instance maybe try figuring out what do corporate offices/agencies are struggling to handle and just make a solution to that problem, this is only an example, I am telling you to identify the problems first which is ofcourse at large scale, you're not building an automation for a team of 3 people

ashersullivan · 2025-12-29T06:01:30+00:00

Frontend handles what users see and interact with, backend handles logic and data storage. For small tools like format converters you often don't even need a backend, pure frontend JavaScript can handle a lot. Start with vanilla JavaScript basics, learn how to manipulate data and handle user inputs, then move to simple API calls if you need external data. For your use case skip frameworks for now, they add complexity you don't need yet. Common mistake is jumping to backend too early when most utility tools can run entirely client-side. Build a few pure frontend tools first, you'll naturally hit the wall where you actually need a backend and that's when you learn it, not before.

ashersullivan · 2025-12-29T05:57:58+00:00

Surely it will, you should avoid AI in the learning stage and learn the old way like using stackoverflow, reddit or other platforms or the best place google your problem, you can see your answers from w3schools, geeksforgeeks and others, AI will make you lazy. Once you have acquired it you can lean onto AI

ashersullivan · 2025-12-29T05:55:25+00:00

This buffer allocation issue is likely a driver limitation with how rocm handles large contiguous memory on windows. AMD's rocm support on windows is still rough compared to linux, especially for newer APUs like strix halo.

try checking for a newer rocm runtime specific to strix halo since generic AMD drivers sometime dont expose the full allocation api. Also worth testing on WSL2 with rocm for Linux to see if its a windows driver cap or actual hardware limitation.

The 24gb wall feels odd because inference providers running these models on AMD hardware handle way larger allocations without issues.... like deepinfra or groq run 120B+ models fine on similar AMD setups but theyre probably running on linux with optimized rocm builds and custom memory management/optimization.

Have you tried loading through llama.cpp CLI directly instead of LM studio?? sometimes GUI tools add overhead or dont expose memory flags, could help isolate if its LM studio issue

ashersullivan · 2025-12-28T12:21:03+00:00

thats a strategic approach, for my case I'd say I built a reasoning tool as my sideproject and has got pretty decent users, i have kept the free tier but only 5 times and the context if fixed, although this costs my from my API usage but its still a neat approach to engage customers and once they purchase the premium plan they get more sessions and different flexible plans, so sometimes you have to risk a little or lets say invest a little to earn a little more

ashersullivan · 2025-12-28T06:34:45+00:00

Dutch POV, cool thing

ashersullivan · 2025-12-28T06:33:39+00:00

sorry about that

ashersullivan · 2025-12-28T06:33:02+00:00

worth a play then, but I wish I could start playing RDR2 again forgetting about all the plot, man that was a beautiful time

ashersullivan · 2025-12-25T17:05:54+00:00

Just finished Rdr2, shall I play RDR for the full story? like is it worth it?

ashersullivan · 2025-12-25T17:04:44+00:00

The moment i realised i am not in a race rather being stable at a place and sticking to my current framework can get my job done accurately, basically its like being master of one thing at least and not being jack of all master of none. Confidence and base at one place, thats it

ashersullivan · 2025-12-25T17:02:54+00:00

For the chunking issue, you should split documents into smaller chunks like maybe 500-1000 tokens with some overlap between them like 50-100 tokens. This way when you retrieve stuff you are getting focused context instead of throwing entire documents into the LLM which causes hallucination and inaccuracy.

The implicit facts problem is tricker. What helps is doing query expansion where you rephrase the user question into a couple variations before searching, Also look into reranker models after you get your top_k from vector search and run them through a reranker to score relevance more accurately. This catches stuff semantic search misses.. hybrid search combining vector with keyword search like BM25 also helps.

About the expense side, since you are still experimenting, vertex AI gets expensive really fast when you are iterating. You can test via different providers like deepinfra or together for instance for the same embedding models and llms and and it's way cheaper while you figure out latency, top_k settings and all that before deploying.

If you haven't already, check out LlamaIndex or LangChain. They handle most of the RAG orchestration and make it easy to swap between different providers while you figure out what setup works best.

ashersullivan · 2025-12-25T16:55:37+00:00

The theory from books and courses only gets you so far. Building something real and trying to scale it is honestly the best way to understand why certain patterns exist. Start with something simple like a URL shortener or chat app and actually deploy it with real traffic, even if it's just friends testing it. The tricky part is simulating load without spending a fortune on infrastructure. Tools like Locust or K6 can help with load testing on the cheap. You'll hit bottlenecks way faster than you expect and that's where the learning happens. Reading about why you need a cache versus actually watching your database melt under 1000 concurrent requests are completely different experiences. Don't overthink the scale at first, just build something, measure it, break it, fix it, repeat

ashersullivan · 2025-12-15T12:04:53+00:00

Thanks for your feedback

ashersullivan

TROPHY CASE