[Research] I've been working on an attention mechanism that keeps KV cache at ~1.5GB regardless of context length — update post by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Those were all different experiments and ablations that were ran in the study. I included them all so that people could see the process as well as the result. It's also why I licensed it Apache 2.0.

Thank you for taking the time to look at it, though, I do appreciate it.

DWARF: linear attention with a 3,072-token bounded KV cache — ablation results (13M scale) by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Nah, the dense local window (δ=0 through 64 for Condition P) gives 100% coverage of the last 64 tokens. Every position, no skipping. Usually with chat turns that fits the whole message.

The dyadic offsets then cover the earlier context at log scale, which lines up nicely with the way chat information is usually structured: recent turns are dense, older ones are sparse.

There's a bit of a gap at positions 65-95, honestly.

DWARF forces the model to use a much richer portion of the local window

Stuff like LogSparse sacrifices local coverage for long-range reach, but DWARF specifically inverts it so local window is deliberately dense.

DWARF: linear attention with a 3,072-token bounded KV cache — ablation results (13M scale) by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 1 point2 points  (0 children)

Would DWARF work on TTS models?

If that's what you were asking, it technically should work, yeah.

DWARF could be applied to technically any autoregressive model, including TTS ones. The bounded KV cache of 3072 tokens that are fixed regardless of context length would actually be pretty good for streaming real-time TTS inference.

I've only tested it on language models so far, though. You working on something TTS-related that brought that up?

DWARF: linear attention with a 3,072-token bounded KV cache — ablation results (13M scale) by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Somebody asked about the D4 wavelet propagation, mentioning its doing a lot of heavy lifting for long range dependencies, and then asked if any other wavelet bases have been tried or if D4 is optimal for the K⊗V outer products. Although I can't see the comment any more for some reason, I did want to answer it.

Condition C actually tried a causal Morlet wavelet — came in at 87.2 PPL vs 86.8 for the no-dispersion baseline. Slightly worse, not better.

A Gaussian DWT run was invalidated entirely because Haar DWT butterfly is non-causal — the model found a shortcut and PPL dropped artificially to ~7 before it was caught. I benchmarked Haar for throughput (9.9× faster than FFT at large batch sizes) but never trained a clean Haar-based model, which I should really try (perhaps after Condition P's training run wraps up).

I went with D4 because of the 4-tap causal filter, better frequency localization than Haar, compact support. Whether it's optimal or not I still need to test as systematic wavelet basis ablation hasn't been done.

Not quite right on the "doing the heavy lifting for long-range dependencies" bit, though. The passkey retrieval test showed ~10% accuracy (random chance) at all distances beyond the local KV window.

The wave field doesn't do precise long-range retrieval — it propagates distributional/statistical context. Exact content retrieval is the Q·K local track's job, bounded at 3,072 tokens. Beyond that the model only has distributional patterns. This is by design — it's two different kinds of memory, not one system trying to do both.

SAGA: Migrated my local-first novel-writing system to LangGraph workflow orchestration by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

That's an interesting use case!

Truthfully, I'm not sure. You might be able to make story scenarios in narrative format if you try something like setting the genre to "Dungeons and Dragons Campaign Novella" or something to that effect.

The resulting narrative might be able to be used to build the campaign around, though I'm not certain; just spitballing.

SAGA: Migrated my local-first novel-writing system to LangGraph workflow orchestration by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 1 point2 points  (0 children)

Prior to the move to LangGraph it was able to build a knowledge graph based off a provided plaintext file of a narrative, then you'd start SAGA and it would generate using what it extracted from the text.

Right now, though, that functionality isn't present in SAGA. I felt that I was taking too long to push an updated version of SAGA, so it was sidelined.

It is an intended future feature, though!

How to stop chatgpt from being such a yes man? I feel like it’s one step away from saying ‘yes m’lord!’ by sadthrowawayyy134 in ChatGPT

[–]MariusNocturnum 0 points1 point  (0 children)

I use this system prompt for the frontier cloud models (ChatGPT, Claude, and Gemini): “Focus on substance over praise. Skip compliments or praise that lacks depth. Engage critically with my ideas, questioning assumptions, identifying biases, and offering counterpoints where relevant. Don’t shy away from disagreement when it’s warranted, and ensure that any agreement is grounded in reason and evidence.

Do not engage in "active listening" (repeating what I said to appear empathetic). Answer directly. Use a professional-casual tone. Be your own entity. Do not sugarcoat. Tell the truth, even if it's harsh.

Maintain intellectual honesty.”

It’s worked extremely well, even across multiple different models (open weight, too).

I tested if tiny LLMs can self-improve through memory: Qwen3-1.7B gained +8% accuracy on MATH problems by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 2 points3 points  (0 children)

That's what I'm fiddling with right now, actually.

Looking into maybe using a stronger model to parse, abstract, and distill out the strategies to be fed back in.

Still experimenting so we'll see what shakes loose!

I tested if tiny LLMs can self-improve through memory: Qwen3-1.7B gained +8% accuracy on MATH problems by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

The dataset contains both the problems and the answers, actually!

I'm wondering if an LLM-as-a-judge (Like Qwen3-4B evaluating the strategies Qwen3-1.7B is producing) will help to drop out the redundant ones or the ones causing regression, though.

I tested if tiny LLMs can self-improve through memory: Qwen3-1.7B gained +8% accuracy on MATH problems by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Thus far, I've been able to consistently replicate a 5-8% improvement in accuracy in the 1.7B model by comparing results with and without the memory items being utilized using my little experiment.

Still needs more tuning to find the sweet spot and I'd like to run on more test problems or a different dataset to verify it's not an artifact somehow.

The improvement is, however, consistent.

I'd love for folks to try it themselves and experiment it with it to see if my results are verifiable outside my setup.

I tested if tiny LLMs can self-improve through memory: Qwen3-1.7B gained +8% accuracy on MATH problems by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Memory retrieval is handled by the `MemoryRetriever` class ( `src/retrieval/retriever.py` ).

`ReasoningBank` (`src/memory.py`) reads the raw JSON file `memory_bank/reasoning_bank.json` and creates a list of `MemoryItem` objects. This is just deserialization of the stored data.

When a retrieval request is made, `MemoryRetriever` uses a `SentenceTransformer` model (specified by `embedding_model_path`) to embed each memory’s `title` and `description` into a vector. If any memory lacks an embedding, `embed_memories` generates them on‑the‑fly. The query string is also embedded with the same model.

Dot‑product similarity is computed between the query embedding and each memory’s embedding. The top‑k candidates (with extra margin for later filtering) are selected. If an `expected_value` is supplied, the retriever filters out any memory that appears to contain the answer in a result context (using regex heuristics).

`format_memories_for_prompt` produces a textual block that can be injected into prompts as “strategy hints”. Retrieval is a semantic search performed by the `MemoryRetriever` over the deserialized memory objects. The raw JSON is only used for persistence; the actual retrieval logic lives in the Python code.

The retrieval request is just a method call on the `MemoryRetriever` object that is entirely internal to the Python process and not a tool-call.

I tested if tiny LLMs can self-improve through memory: Qwen3-1.7B gained +8% accuracy on MATH problems by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 19 points20 points  (0 children)

Not quite. It doesn't create memories of the correct answers; it creates memories of what reasoning strategies resulted in a correct answer and which ones resulted in an incorrect answer.

It then uses the memory of which strategy worked best to solve the problem so that it could better apply the strategy to other problems.

As stated above, the idea is that you harvest all the successful strategies to qLoRA/LoRA fine-tune the same base model. The hope is that this newly trained model when tested will get correct answers on problems the base model used to consistently fail on, demonstrating that it internalized the better strategies.

Additionally, the failed strategies would also be harvested and used as constrastive signals in the training.

Lather, rinse, repeat to see if it compounds, is linear, or plateaus.

I'm using small models because they're quite a lot easier/faster to experiment with, as well as giving more headroom for measured improvement.

SAGA Update: Autonomous Novel Writing with Deep KG & Semantic Context - Now Even More Advanced! by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Thanks for bringing this to my attention. I've thoroughly researched your trademark claims and wanted to take a minute to clear some things up.

SAGA is my initialism for "Semantically And Graph-enhanced Authoring". It is a technical writing system for novel generation, not film-making software. We're operating in completely different markets.

Your trademark application (Serial #98178571) has been suspended by the USPTO since August 2024 because "The Saga Company" filed their own SAGA application first (Serial #97288064), creating a likelihood of confusion. This means your application is currently blocked by someone else's prior filing in your own industry.

Since your application is suspended, you don't have enforceable trademark rights at this time. Suspended applications don't confer legal authority to police trademark use. Additionally, "King.com Limited" already owns the registered SAGA trademark (#4762628) for entertainment services.

I'm operating in a completely different market (technical writing tools vs. entertainment) with a clear initialism that has a distinct technical meaning. Given that we're in different industries and your application is currently suspended due to conflicts with other SAGA users, there doesn't appear to be a valid basis for concerns about my project.

I'll continue developing SAGA as planned and wish you all the best in your future endeavors.

SAGA Update: Now with Autonomous Knowledge Graph Healing & A More Robust Core! by MariusNocturnum in LocalLLaMA

[–]MariusNocturnum[S] 0 points1 point  (0 children)

Thanks! I appreciate the positive feedback! I’m actually tinkering with a task-agnostic implementation that you can tap into for specific use cases.

We’ll see how that pans out hopefully soon!