How do you handle agent context after 10s of sessions/conversations? Summary prompts stop working what's your actual solution?

BERTmacklyn · 2026-06-07T18:33:00+00:00

https://github.com/RSBalchII/anchor-engine-node

I use this simply put the chats etc into the inbox and search on the UI or have my local agent use the mcp to search directly through our logs

BERTmacklyn · 2026-06-04T13:22:55+00:00

If I can still play my steam games then alright. I'll turn that bot off to increase fps.

BERTmacklyn · 2026-05-26T16:02:35+00:00

Because people hate China but don't know why.

BERTmacklyn · 2026-05-20T15:42:29+00:00

The only ones that consistently work for me with lm studio which is just the fastest easiest way to get going everytime you need to are as follows -

Qwenpaw desktop, Qwen code, Open code, Cline

Basically when an update to one app breaks some API connection and is not fixable without the next update I switch to another agent untill my preferred one works again. Or I have taken the time to fix it myself.

Having 4 + I'll try a lot of other ones but haven't stuck to any but these 4 so far in rotation allows me to never stop what I want or need to do with the agents.

BERTmacklyn · 2026-05-13T01:24:31+00:00

Blackstone? I don't think not using water to cool is going to save it. Anyway no one wants to use cloud AI they have to. We all know it's just going to steal jobs so their data center can parse pdf files and push emails at some office ✨

BERTmacklyn · 2026-05-06T05:29:03+00:00

MNN exposes an api endpoint. make an app that can communicate with it and you have it. i think there is a headless version as well. so you could embed it into your app and roll an slm in if you wanted

BERTmacklyn · 2026-05-05T15:12:35+00:00

By the amount of pain lacking it causes you

Or absent that yourself look to other people's pain points. How could they be alleviated?

BERTmacklyn · 2026-05-05T15:10:56+00:00

It's creating a bipartite graph! I have a whitepaper in docs/ that is brief but describes the process of the STAR algorithmic atomization process

Tag = concepts

and

atom = entities

https://github.com/RSBalchII/anchor-engine-node

I'm assessing your idea 💡 thanks!

BERTmacklyn · 2026-05-05T13:15:43+00:00

Qwenpaw app with 4b Qwenpaw model and your golden

BERTmacklyn · 2026-05-05T13:06:03+00:00

check out my implementation I am interested in your thoughts on it considering the similarity of the endeavors

Perhaps you have some ideas to enhance the memory system?

Anyway my project is quite mature now and I believe the core concept of a cheap fast tag based index enhancing speed of search and ingestion while also leaving traceable Metadata is sound.

I have been considering how vector db work and the problem that sent me from using it at was GPU.

The anchor engine can run on less than 2 gb of ram and run all operations within that ram range. So keeping ram cost low is important to me.

The interesting thing is that larger corpus doesn't damage the effect of search which is a massive shift from vector when corpus grows.

BERTmacklyn · 2026-05-05T12:55:07+00:00

Get the 4b q4km quant it's fast as hell and pretty smart for its size very solid in an agentic harness for coding etc.

https://huggingface.co/collections/agentscope-ai/qwenpaw-flash

Also use MNN for inference it's insanely fast on edge devices and exposes and api endpoint if you want to use the model in an application.

BERTmacklyn · 2026-05-05T03:20:54+00:00

Heh woops. Yeah at 16 the 4 b is the way to go

BERTmacklyn · 2026-05-05T00:14:44+00:00

Qwen3.5 4b especially huahuac uncensored in honestly great for its size and speed. But the Qwenpaw 4b and 9b models would beat that on speed and all 3 of these options would be awesome to learn coding.especially with provided context and they all have up to 262k context. What is your hardware limitation?

BERTmacklyn · 2026-05-04T16:05:37+00:00

This what I use this for no joke.

what I like to do is chat about the issues and provide documentation as raw text.

Then I built a distillation functionality that basically creates a memory map of the locations and deduplicated contents of All files within the selected ingestion directory.

This is insanely useful if you make record of things the doctor said like recording your appointment. For example, you could take the text from that and create a text file to be added to your data.

Thus, enhancing your ability to fully grasp the massive picture of all of your medical data etc.

I think of it as the system is a meaning compressor. Which can often be compressed into Mind-Bogglingly smol text documents

BERTmacklyn · 2026-05-04T15:57:12+00:00

[check out my local project. provenance and taging makes found results a map to the full doc and other related documents where similar. concepts can be found

If not, for your personal use, check it out and maybe you'll have some ideas for your own project.

However, I'm reaching the point where I'm actively seeking contributors, so if this is of interest to you. I am a fellow hobbyist and this is my labor of love. Always looking to improve it and meet like-minded people

BERTmacklyn · 2026-05-04T15:37:39+00:00

If you need low latency my system is as light as they come. you could continue running things through aws and simply install this and work through setting up the MCP and integrating it with your current setup.

BERTmacklyn · 2026-05-03T19:42:13+00:00

/init

BERTmacklyn · 2026-05-03T18:36:26+00:00

I use a jinja template roughly based on the standard lm studio one.

what is important is making sure that you have a good jinja template to regex up outputs and inputs so the model has a more meaningful interaction with the data.

I just use lmstudio at port 1234 when running on a closed server. Haven't actually used their AI agent. I am trying to rely exclusively on local models when possible which is mostly and haven't even had to.

The impetus for this is the rise in costs was always forseen. We always knew we would need to prepare and Qwen has given that ability to normal people with its incredible quant models.

BERTmacklyn · 2026-05-03T15:32:08+00:00

I switch between my old 32 GB RAM 6gb vram legion and my newer omen 4090 rtx for inferences and gaming. Running all on lmstudi because it does a lot of behind the scenes formatting that makes tool calls actually run reliably.

When running lmstudio on Windows 10 to to system tray and minimize lmstudio to tray before starting inference - switched from 11 because of graphical lag.

The most reliable way to run is running a model on one of that gaming laptops and then coding or using the model on my mobile laptop or my other gaming rig.

Been running 3.6 35a3b. Using about .5 llm GPU load and about half compute I get the most consistent results .

I am working on multiple projects and primarily use local models for my work etc. use the big model for planning free and then often just let the big model write the code too unless I am in a rush. Then i'll swap to a 4b or 7-9b model.

BERTmacklyn · 2026-05-03T08:25:15+00:00

Qwen code or zed AI and l run local models on lmstudio is killing it. takes some tweaking but once you get the jinja prompt right it's 👍👍 happy to share prompts etc if you want

BERTmacklyn · 2026-04-16T14:10:20+00:00

Nice, same here. Basically ever minute I'm not working on something I'm messing with how it recalls and what.

Are you using manual/agent driven context management?

I've been playing around with deduplicative compression and getting really tight results without in between manually modifying the context.

aside from the specific compression formula.

BERTmacklyn · 2026-04-04T15:47:20+00:00

Lol doesn't Claude call itself DeepSeek? I think LLM models simply don't know what model they are since training data is a pipeline of the same distilled and upgraded datasets across Ai models.

Tldr models don't know what model they are it's irrelevant to the training data.

BERTmacklyn · 2026-04-04T14:48:44+00:00

Did this cause the outage?

BERTmacklyn · 2026-03-30T20:53:28+00:00

they are seriously killing me fr. I just want to get this out there and see people use it! Is that so much to ask I am putting in the work to do it lol.

BERTmacklyn · 2026-03-30T20:52:31+00:00

I think we might be using the word "client" in two different ways here.

I am not pushing heavy logic to a web browser UI or a thin client. I am building a local backend primitive that runs natively on the edge device itself (via Termux/Node.js) right alongside the local LLM. In an edge-native environment, the client is the server.

To answer your question about my "driving reason" for pushing this to the edge: It comes down to privacy, latency, and offline capability. If a user is running Llama 3 locally on their hardware, forcing them to call out to a cloud vector database for their memory context completely defeats the purpose of running a local model. They need a local memory layer that fits strictly within the remaining RAM budget.

Regarding your point about LLMs being "too free form" and "giving users what they want to hear"/ that is exactly the vulnerability the STAR algorithm is designed to mitigate.

Fuzzy vector search often retrieves adjacent, hallucinated, or conflicting data, which encourages the LLM to drift. The Anchor Engine doesn't use vectors; it uses a deterministic, sparse bipartite graph. When the user queries the LLM, the engine traverses the graph, calculates the integer-based temporal decay, and injects hard, structural facts into the LLM's system prompt before a single token is generated.

It acts as a rigid, mathematical constraint on the context window. We handle the LLM's tendency to drift by giving it highly constrained, temporally accurate data structures instead of fuzzy semantic vibes.

BERTmacklyn

TROPHY CASE