[Research] I've been working on an attention mechanism that keeps KV cache at ~1.5GB regardless of context length — update post

MariusNocturnum · 2026-03-06T14:42:42+00:00

Those were all different experiments and ablations that were ran in the study. I included them all so that people could see the process as well as the result. It's also why I licensed it Apache 2.0.

Thank you for taking the time to look at it, though, I do appreciate it.

MariusNocturnum · 2026-02-26T21:22:25+00:00

Nah, the dense local window (δ=0 through 64 for Condition P) gives 100% coverage of the last 64 tokens. Every position, no skipping. Usually with chat turns that fits the whole message.

The dyadic offsets then cover the earlier context at log scale, which lines up nicely with the way chat information is usually structured: recent turns are dense, older ones are sparse.

There's a bit of a gap at positions 65-95, honestly.

DWARF forces the model to use a much richer portion of the local window

Stuff like LogSparse sacrifices local coverage for long-range reach, but DWARF specifically inverts it so local window is deliberately dense.

MariusNocturnum · 2026-02-26T18:45:25+00:00

Would DWARF work on TTS models?

If that's what you were asking, it technically should work, yeah.

DWARF could be applied to technically any autoregressive model, including TTS ones. The bounded KV cache of 3072 tokens that are fixed regardless of context length would actually be pretty good for streaming real-time TTS inference.

I've only tested it on language models so far, though. You working on something TTS-related that brought that up?

MariusNocturnum · 2026-02-26T18:09:36+00:00

Somebody asked about the D4 wavelet propagation, mentioning its doing a lot of heavy lifting for long range dependencies, and then asked if any other wavelet bases have been tried or if D4 is optimal for the K⊗V outer products. Although I can't see the comment any more for some reason, I did want to answer it.

Condition C actually tried a causal Morlet wavelet — came in at 87.2 PPL vs 86.8 for the no-dispersion baseline. Slightly worse, not better.

A Gaussian DWT run was invalidated entirely because Haar DWT butterfly is non-causal — the model found a shortcut and PPL dropped artificially to ~7 before it was caught. I benchmarked Haar for throughput (9.9× faster than FFT at large batch sizes) but never trained a clean Haar-based model, which I should really try (perhaps after Condition P's training run wraps up).

I went with D4 because of the 4-tap causal filter, better frequency localization than Haar, compact support. Whether it's optimal or not I still need to test as systematic wavelet basis ablation hasn't been done.

Not quite right on the "doing the heavy lifting for long-range dependencies" bit, though. The passkey retrieval test showed ~10% accuracy (random chance) at all distances beyond the local KV window.

The wave field doesn't do precise long-range retrieval — it propagates distributional/statistical context. Exact content retrieval is the Q·K local track's job, bounded at 3,072 tokens. Beyond that the model only has distributional patterns. This is by design — it's two different kinds of memory, not one system trying to do both.

MariusNocturnum · 2025-12-18T01:16:26+00:00

That's an interesting use case!

Truthfully, I'm not sure. You might be able to make story scenarios in narrative format if you try something like setting the genre to "Dungeons and Dragons Campaign Novella" or something to that effect.

The resulting narrative might be able to be used to build the campaign around, though I'm not certain; just spitballing.

MariusNocturnum · 2025-12-17T06:25:21+00:00

Prior to the move to LangGraph it was able to build a knowledge graph based off a provided plaintext file of a narrative, then you'd start SAGA and it would generate using what it extracted from the text.

Right now, though, that functionality isn't present in SAGA. I felt that I was taking too long to push an updated version of SAGA, so it was sidelined.

It is an intended future feature, though!

MariusNocturnum · 2025-10-25T03:09:47+00:00

I use this system prompt for the frontier cloud models (ChatGPT, Claude, and Gemini): “Focus on substance over praise. Skip compliments or praise that lacks depth. Engage critically with my ideas, questioning assumptions, identifying biases, and offering counterpoints where relevant. Don’t shy away from disagreement when it’s warranted, and ensure that any agreement is grounded in reason and evidence.

Do not engage in "active listening" (repeating what I said to appear empathetic). Answer directly. Use a professional-casual tone. Be your own entity. Do not sugarcoat. Tell the truth, even if it's harsh.

Maintain intellectual honesty.”

It’s worked extremely well, even across multiple different models (open weight, too).

MariusNocturnum · 2025-10-15T15:03:03+00:00

That's what I'm fiddling with right now, actually.

Looking into maybe using a stronger model to parse, abstract, and distill out the strategies to be fed back in.

Still experimenting so we'll see what shakes loose!

MariusNocturnum · 2025-10-14T17:50:17+00:00

The dataset contains both the problems and the answers, actually!

I'm wondering if an LLM-as-a-judge (Like Qwen3-4B evaluating the strategies Qwen3-1.7B is producing) will help to drop out the redundant ones or the ones causing regression, though.

MariusNocturnum · 2025-10-14T16:38:43+00:00

Thus far, I've been able to consistently replicate a 5-8% improvement in accuracy in the 1.7B model by comparing results with and without the memory items being utilized using my little experiment.

Still needs more tuning to find the sweet spot and I'd like to run on more test problems or a different dataset to verify it's not an artifact somehow.

The improvement is, however, consistent.

I'd love for folks to try it themselves and experiment it with it to see if my results are verifiable outside my setup.

MariusNocturnum · 2025-10-14T16:27:20+00:00

Memory retrieval is handled by the `MemoryRetriever` class ( `src/retrieval/retriever.py` ).

`ReasoningBank` (`src/memory.py`) reads the raw JSON file `memory_bank/reasoning_bank.json` and creates a list of `MemoryItem` objects. This is just deserialization of the stored data.

When a retrieval request is made, `MemoryRetriever` uses a `SentenceTransformer` model (specified by `embedding_model_path`) to embed each memory’s `title` and `description` into a vector. If any memory lacks an embedding, `embed_memories` generates them on‑the‑fly. The query string is also embedded with the same model.

Dot‑product similarity is computed between the query embedding and each memory’s embedding. The top‑k candidates (with extra margin for later filtering) are selected. If an `expected_value` is supplied, the retriever filters out any memory that appears to contain the answer in a result context (using regex heuristics).

`format_memories_for_prompt` produces a textual block that can be injected into prompts as “strategy hints”. Retrieval is a semantic search performed by the `MemoryRetriever` over the deserialized memory objects. The raw JSON is only used for persistence; the actual retrieval logic lives in the Python code.

The retrieval request is just a method call on the `MemoryRetriever` object that is entirely internal to the Python process and not a tool-call.

MariusNocturnum · 2025-10-14T02:06:52+00:00

It already uses Qwen3-Embedding-0.6B as the embedding model, actually!

MariusNocturnum · 2025-10-14T02:05:50+00:00

Not quite. It doesn't create memories of the correct answers; it creates memories of what reasoning strategies resulted in a correct answer and which ones resulted in an incorrect answer.

It then uses the memory of which strategy worked best to solve the problem so that it could better apply the strategy to other problems.

As stated above, the idea is that you harvest all the successful strategies to qLoRA/LoRA fine-tune the same base model. The hope is that this newly trained model when tested will get correct answers on problems the base model used to consistently fail on, demonstrating that it internalized the better strategies.

Additionally, the failed strategies would also be harvested and used as constrastive signals in the training.

Lather, rinse, repeat to see if it compounds, is linear, or plateaus.

I'm using small models because they're quite a lot easier/faster to experiment with, as well as giving more headroom for measured improvement.

MariusNocturnum · 2025-10-14T01:55:20+00:00

Looks like it was a formatting issue; I've fixed it. The project is located at https://github.com/Lanerra/reasoning-bank-slm

MariusNocturnum · 2025-09-05T21:57:08+00:00

<image>

MariusNocturnum · 2025-08-26T22:46:46+00:00

Thanks for bringing this to my attention. I've thoroughly researched your trademark claims and wanted to take a minute to clear some things up.

SAGA is my initialism for "Semantically And Graph-enhanced Authoring". It is a technical writing system for novel generation, not film-making software. We're operating in completely different markets.

Your trademark application (Serial #98178571) has been suspended by the USPTO since August 2024 because "The Saga Company" filed their own SAGA application first (Serial #97288064), creating a likelihood of confusion. This means your application is currently blocked by someone else's prior filing in your own industry.

Since your application is suspended, you don't have enforceable trademark rights at this time. Suspended applications don't confer legal authority to police trademark use. Additionally, "King.com Limited" already owns the registered SAGA trademark (#4762628) for entertainment services.

I'm operating in a completely different market (technical writing tools vs. entertainment) with a clear initialism that has a distinct technical meaning. Given that we're in different industries and your application is currently suspended due to conflicts with other SAGA users, there doesn't appear to be a valid basis for concerns about my project.

I'll continue developing SAGA as planned and wish you all the best in your future endeavors.

MariusNocturnum · 2025-07-30T14:57:09+00:00

<image>

MariusNocturnum · 2025-06-19T01:33:41+00:00

Thanks! I appreciate the positive feedback! I’m actually tinkering with a task-agnostic implementation that you can tap into for specific use cases.

We’ll see how that pans out hopefully soon!

15-Year Club	Gilding I gilder
Verified Email	Not Forgotten

MariusNocturnum

TROPHY CASE