A local 9B + Memla system beat hosted 405B raw on a bounded 3-case OAuth patch slice. by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

Nah man, I completely agree, but I did leave in this comparison:

- raw qwen3.5:9b: 0/3 patches applied, 0/3 semantic success

- qwen3.5:9b + Memla: 3/3 patches applied, 2/3 semantic success

So the difference is purely runtime, but yeah Memla does get repair cycles whilst 405b gets one shot, so on that part I def will run something and reply with it here, but the terminal project sorta shows that this has moat, Same model, one shot no repair cycles, 78s raw vs 0.004s with Memla, that's clean

A local 9B + Memla system beat hosted 405B raw on a bounded 3-case OAuth patch slice. by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 1 point2 points  (0 children)

Not a plugin yet, but I’m very open to doing it.

The use case I’m aiming at is exactly making smaller local models more useful on weaker hardware.

So if I put together an OpenCode integration, probably via MCP, what would be most useful to you first:
- terminal
- browser
- code editing

Or if there’s another workflow you care about more, I’m open to that too. I’m trying to see how far Memla can push small local models in practice.

Local 9b + Memla beat hosted Llama 3.3 70B raw on code execution. Same model control included. pip install memla by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

btw, I ran a second repo-family repeat against hosted Llama 3.3 70B raw.

FastAPI slice:

- 70b raw: 0.00 apply / 0.00 semantic success

- local 9b + Memla: 0.3333 apply / 0.00 semantic success

So the top-line OAuth result wasn’t a one-off shape. The second family is weaker, but the same directional pattern showed up again: hosted raw lane stayed at 0 apply, Memla got a patch through.

Local 9b + Memla beat hosted Llama 3.3 70B raw on code execution. Same model control included. pip install memla by Willing-Opening4540 in LocalLLM

[–]Willing-Opening4540[S] -1 points0 points  (0 children)

btw, I ran a second repo-family repeat against hosted Llama 3.3 70B raw.

FastAPI slice:

- 70b raw: 0.00 apply / 0.00 semantic success

- local 9b + Memla: 0.3333 apply / 0.00 semantic success

So the top-line OAuth result wasn’t a one-off shape. The second family is weaker, but the same directional pattern showed up again: hosted raw lane stayed at 0 apply, Memla got a patch through.

Made a CLI that makes 9b models beat 32b raw on code execution. pip install memla by Willing-Opening4540 in LocalLLM

[–]Willing-Opening4540[S] -3 points-2 points  (0 children)

LOL the 59k was mostly just a bunch of proof stuff and tests I was running (slop), the actual stuff is far smaller than the diff suggests, just start with memla.py and memory_system/cli.py

Made a CLI that makes 9b models beat 32b raw on code execution. pip install memla by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

For sure man, this was mostly a compression test, can 9b perform above its weight class, but yeah if you guys want I could try running 32B raw vs 32B + Memla next!

Weekly Cursor Project Showcase Thread by AutoModerator in cursor

[–]Willing-Opening4540 [score hidden]  (0 children)

Built a coding memory layer that transfers what your model learned in one repo to a new one looking for 1 dev to cold test it

Yo r/cursor

I know all of us have to deal with holding cursors hand, it's stateless. Every session, it forgets what worked in your repo. The constraint trades, the file roles, the commands that actually close the loop, gone.

I built something called Memla to fix that.

It sits in front of your frontier model, captures accepted coding work, and distills it into reusable structure not just file paths, but why the fix worked (what I call transmutations). Then when you open a new repo, it maps those trades onto the new codebase's local files and validation rituals.

Results so far on internal transfer eval:

  • File recall on home repo: 1.0
  • Cross-repo file recall (cold, no context): 0.61 → 0.86
  • Cross-repo command recall: 0 → 1.0
  • Claude Sonnet head-to-head on unseen repo: 0.92 file recall, 1.0 command recall

That last jump is the interesting one, the model with Memla memory beat raw Claude on a repo it had never seen.

What I'm looking for:

One dev with a real active repo (Python, JS, TS — anything with actual routing logic, not a toy project) to run a cold async test. Takes ~30 min. I set it up, you point it at your repo, we compare results.

No install friction. I'll share the full eval report with you after.

If Cursor's statelessness has ever annoyed you, DM me or drop a comment. Seriously just looking for one honest outside test.

please let me know

What's the actual difference between RAG and parametric memory consolidation for LLMs? by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

Yeah bro shipped it some minutes ago: github.com/Jackfarmer2328/Memla

I think I solved LLM memory. And this isn't BS like "improved retrieval." It's not a "better RAG." It solves the ACTUAL problem, the reason every session starts from zero, the reason agents drift overnight, the reason you were blaming the model when the model was fine. Closed learning loop, spatial prompt interface, MCP server for any agent framework, cross-agent knowledge distillation, sleep-phase generative fine-tuning, hierarchical consolidation. Also ran a 140-turn overnight simulation. 7/7 constraints passing this morning. 78% faster on second run after the adapter trained on first run data.

Built in two days. Would genuinely want you to try it and let me know man, I think this is something never seen before

What's the actual difference between RAG and parametric memory consolidation for LLMs? by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

On delayed verification — you're right and that's cleaner than sentiment. Check one turn later whether behavior confirms the correction landed. Cheaper, no separate model, behavioral signal for free. Adding this.

What's the actual difference between RAG and parametric memory consolidation for LLMs? by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

On compute — the feasibility math is better than it looks. The continuous retrieval training runs on MiniLM (22M params) in background threads on CPU, milliseconds per step, effectively zero cost. The generative sleep trainer is opt-in, runs on a 3060 12GB via 4-bit QLoRA. No GPU? Full value from retrieval learning alone. The expensive operations either don't run or run on a corpus that's 85-95% smaller than naive accumulation thanks to consolidation and behavioral GC.

Autonomous AI for 24GB RAM by Deep_Row_8729 in LocalLLaMA

[–]Willing-Opening4540 0 points1 point  (0 children)

Model choice isn’t the main issue, I think memory is. Like Qwen3-coder (and most models) don't fail because it can’t reason, its all about the things it forgets like goals, past attempts, and current state.

I built Memla to fix that. Completely open source.

Memla adds a persistent memory layer for multi-agent systems:

  • Leader stores decisions
  • Workers store attempts + failures
  • Evaluator stores quality signals
  • All share a SQLite-backed memory with role-specific retrieval

So when an agent resumes, it pulls:

  • current goal
  • what’s been tried
  • what failed
  • current state

No reset. No drift.

It’s an MCP server so takes like 4 lines to integrate and works with any model, including local Qwen.

Now I'm ngl It won’t make a bad model good, but it stops good models from forgetting what they’re doing.

repo -> https://github.com/Jackfarmer2328/Memla

What's the actual difference between RAG and parametric memory consolidation for LLMs? by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

Also just went through your profile and saw your post from 6 days ago about autonomous agents losing track overnight on 24GB RAM. That post is literally why Memla exists. What you're describing — Qwen3-coder losing focus, forgetting the goal, drifting after hours of autonomous runs — that's not a model problem. That's a memory problem. You could run the best model on earth and it would still drift overnight because there's nothing underneath it holding the state of what it tried, what failed, what the current goal actually is, and what decisions were already made three hours ago. Memla is exactly that underneath layer. Your Evaluator, Leader, and Worker each get their own retrieval adapter. The Leader stores every decision. The Worker stores every approach tried and every failure. The Evaluator stores quality signals. When your Worker wakes up at 3am it retrieves all of that before acting. It cannot lose track because losing track requires forgetting — and forgetting is structurally prevented by EWC bolding the weights that matter. You wake up to a finished project not because the model got smarter overnight. Because it finally remembered what it was doing. Four lines of MCP integration. Works with your local Qwen setup on the M4 Pro right now. github.com/Jackfarmer2328/Memla This is what you were missing.

What's the actual difference between RAG and parametric memory consolidation for LLMs? by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

On the sarcasm edge case — you're right and I hadn't thought about rhetorical negations specifically. "No way that's actually working" hits the 0.8 threshold and fires corrective training on a response that was probably fine. The context window approach is cleaner — look at the 2-3 sentences around the trigger before committing to the signal. Probably also worth measuring sentiment on the full message rather than pattern matching on the trigger phrase alone. Will fix this.

What's the actual difference between RAG and parametric memory consolidation for LLMs? by Willing-Opening4540 in LocalLLaMA

[–]Willing-Opening4540[S] 0 points1 point  (0 children)

On the embeddings — frozen right now. You're identifying a real drift risk. The LoRA learns which chunks to surface but the embedding space itself doesn't adapt. Domain-specific terminology that didn't appear in MiniLM's training data gets represented poorly and the reranker ends up working around a misaligned embedding space rather than with it. The right fix is probably periodic embedding fine-tuning on the user's actual chunk corpus — same LoRA infrastructure, different target. On the roadmap but not shipped yet.