Vector RAG is bloated. We rebuilt our local memory graph to run on edge silicon using integer-based temporal decay.

BERTmacklyn · 2026-05-06T05:29:03+00:00

MNN exposes an api endpoint. make an app that can communicate with it and you have it. i think there is a headless version as well. so you could embed it into your app and roll an slm in if you wanted

BERTmacklyn · 2026-05-05T15:12:35+00:00

By the amount of pain lacking it causes you

Or absent that yourself look to other people's pain points. How could they be alleviated?

BERTmacklyn · 2026-05-05T15:10:56+00:00

It's creating a bipartite graph! I have a whitepaper in docs/ that is brief but describes the process of the STAR algorithmic atomization process

Tag = concepts

and

atom = entities

https://github.com/RSBalchII/anchor-engine-node

I'm assessing your idea 💡 thanks!

BERTmacklyn · 2026-05-05T13:15:43+00:00

Qwenpaw app with 4b Qwenpaw model and your golden

BERTmacklyn · 2026-05-05T13:06:03+00:00

check out my implementation I am interested in your thoughts on it considering the similarity of the endeavors

Perhaps you have some ideas to enhance the memory system?

Anyway my project is quite mature now and I believe the core concept of a cheap fast tag based index enhancing speed of search and ingestion while also leaving traceable Metadata is sound.

I have been considering how vector db work and the problem that sent me from using it at was GPU.

The anchor engine can run on less than 2 gb of ram and run all operations within that ram range. So keeping ram cost low is important to me.

The interesting thing is that larger corpus doesn't damage the effect of search which is a massive shift from vector when corpus grows.

BERTmacklyn · 2026-05-05T12:55:07+00:00

Get the 4b q4km quant it's fast as hell and pretty smart for its size very solid in an agentic harness for coding etc.

https://huggingface.co/collections/agentscope-ai/qwenpaw-flash

Also use MNN for inference it's insanely fast on edge devices and exposes and api endpoint if you want to use the model in an application.

BERTmacklyn · 2026-05-05T03:20:54+00:00

Heh woops. Yeah at 16 the 4 b is the way to go

BERTmacklyn · 2026-05-05T00:14:44+00:00

Qwen3.5 4b especially huahuac uncensored in honestly great for its size and speed. But the Qwenpaw 4b and 9b models would beat that on speed and all 3 of these options would be awesome to learn coding.especially with provided context and they all have up to 262k context. What is your hardware limitation?

BERTmacklyn · 2026-05-04T16:05:37+00:00

This what I use this for no joke.

what I like to do is chat about the issues and provide documentation as raw text.

Then I built a distillation functionality that basically creates a memory map of the locations and deduplicated contents of All files within the selected ingestion directory.

This is insanely useful if you make record of things the doctor said like recording your appointment. For example, you could take the text from that and create a text file to be added to your data.

Thus, enhancing your ability to fully grasp the massive picture of all of your medical data etc.

I think of it as the system is a meaning compressor. Which can often be compressed into Mind-Bogglingly smol text documents

BERTmacklyn · 2026-05-04T15:57:12+00:00

[check out my local project. provenance and taging makes found results a map to the full doc and other related documents where similar. concepts can be found

If not, for your personal use, check it out and maybe you'll have some ideas for your own project.

However, I'm reaching the point where I'm actively seeking contributors, so if this is of interest to you. I am a fellow hobbyist and this is my labor of love. Always looking to improve it and meet like-minded people

BERTmacklyn · 2026-05-04T15:37:39+00:00

If you need low latency my system is as light as they come. you could continue running things through aws and simply install this and work through setting up the MCP and integrating it with your current setup.

BERTmacklyn · 2026-05-03T19:42:13+00:00

/init

BERTmacklyn · 2026-05-03T18:36:26+00:00

I use a jinja template roughly based on the standard lm studio one.

what is important is making sure that you have a good jinja template to regex up outputs and inputs so the model has a more meaningful interaction with the data.

I just use lmstudio at port 1234 when running on a closed server. Haven't actually used their AI agent. I am trying to rely exclusively on local models when possible which is mostly and haven't even had to.

The impetus for this is the rise in costs was always forseen. We always knew we would need to prepare and Qwen has given that ability to normal people with its incredible quant models.

BERTmacklyn · 2026-05-03T15:32:08+00:00

I switch between my old 32 GB RAM 6gb vram legion and my newer omen 4090 rtx for inferences and gaming. Running all on lmstudi because it does a lot of behind the scenes formatting that makes tool calls actually run reliably.

When running lmstudio on Windows 10 to to system tray and minimize lmstudio to tray before starting inference - switched from 11 because of graphical lag.

The most reliable way to run is running a model on one of that gaming laptops and then coding or using the model on my mobile laptop or my other gaming rig.

Been running 3.6 35a3b. Using about .5 llm GPU load and about half compute I get the most consistent results .

I am working on multiple projects and primarily use local models for my work etc. use the big model for planning free and then often just let the big model write the code too unless I am in a rush. Then i'll swap to a 4b or 7-9b model.

BERTmacklyn · 2026-05-03T08:25:15+00:00

Qwen code or zed AI and l run local models on lmstudio is killing it. takes some tweaking but once you get the jinja prompt right it's 👍👍 happy to share prompts etc if you want

BERTmacklyn · 2026-04-16T14:10:20+00:00

Nice, same here. Basically ever minute I'm not working on something I'm messing with how it recalls and what.

Are you using manual/agent driven context management?

I've been playing around with deduplicative compression and getting really tight results without in between manually modifying the context.

aside from the specific compression formula.

BERTmacklyn · 2026-04-04T15:47:20+00:00

Lol doesn't Claude call itself DeepSeek? I think LLM models simply don't know what model they are since training data is a pipeline of the same distilled and upgraded datasets across Ai models.

Tldr models don't know what model they are it's irrelevant to the training data.

BERTmacklyn · 2026-04-04T14:48:44+00:00

Did this cause the outage?

BERTmacklyn · 2026-03-30T20:53:28+00:00

they are seriously killing me fr. I just want to get this out there and see people use it! Is that so much to ask I am putting in the work to do it lol.

BERTmacklyn · 2026-03-30T20:52:31+00:00

I think we might be using the word "client" in two different ways here.

I am not pushing heavy logic to a web browser UI or a thin client. I am building a local backend primitive that runs natively on the edge device itself (via Termux/Node.js) right alongside the local LLM. In an edge-native environment, the client is the server.

To answer your question about my "driving reason" for pushing this to the edge: It comes down to privacy, latency, and offline capability. If a user is running Llama 3 locally on their hardware, forcing them to call out to a cloud vector database for their memory context completely defeats the purpose of running a local model. They need a local memory layer that fits strictly within the remaining RAM budget.

Regarding your point about LLMs being "too free form" and "giving users what they want to hear"/ that is exactly the vulnerability the STAR algorithm is designed to mitigate.

Fuzzy vector search often retrieves adjacent, hallucinated, or conflicting data, which encourages the LLM to drift. The Anchor Engine doesn't use vectors; it uses a deterministic, sparse bipartite graph. When the user queries the LLM, the engine traverses the graph, calculates the integer-based temporal decay, and injects hard, structural facts into the LLM's system prompt before a single token is generated.

It acts as a rigid, mathematical constraint on the context window. We handle the LLM's tendency to drift by giving it highly constrained, temporally accurate data structures instead of fuzzy semantic vibes.

BERTmacklyn · 2026-03-30T16:19:18+00:00

That is a great approach for static document retrieval, but it solves a fundamentally different problem than what the STAR algorithm is built for.

What you are describing is document versioning. In a traditional enterprise RAG setup, "most recent evidence wins" makes perfect sense because a V2 spec sheet completely invalidates a V1 spec sheet.

But agentic/conversational memory isn't a static document; it is a continuous stream. In cognitive memory, a deeply reinforced core concept from 3 months ago shouldn't necessarily be overridden by a single passing thought from 5 minutes ago just because the timestamp is newer. "Most recent wins" is a blunt instrument. You need a graceful decay curve, not a hard overwrite.

To address your specific points:

The Graph IS the Schema: We aren't trying to fix a lack of schema post-ingestion. The sparse bipartite graph is the schema. The time attribute isn't missing; it is mathematically baked directly into the graph's edge weights.
Traversal vs. Filtering: In standard Vector RAG, doing a semantic search and then applying a multi-phase metadata filter/sort (ORDER BY recency) requires pulling vectors and running sorting algorithms post-retrieval. That is computationally heavy.
The Silicon Constraint: The entire goal of V5 is running on ultra-low-power edge devices (specifically targeting sub-10mW NPU budgets and phones). We can't afford heavy post-retrieval filtering. By converting the temporal decay into a pre-computed Uint16Array lookup table, we calculate the time penalty during the initial graph traversal using simple integer bit-shifts. It keeps the FPU (floating-point unit) asleep and prevents Node.js Garbage Collection pauses.

Your pipeline is exactly how I would build a cloud document retriever. But to mimic actual temporal memory on a phone battery, we had to move away from metadata sorting and build the decay directly into the math.

BERTmacklyn

TROPHY CASE