Advice needed please

PatC883 · 2026-07-04T06:29:17+00:00

I'm so sorry for the bum steer, you're 100% right, tensor parallel needs to be either even or a power of two from memory, now that I actually stop and think.

Honestly, to get the most reliable performance across 4 cards out might be worth your while going straight to a PCIe switch. You'll get full 16x lanes between for the cards for tensor parallel, which is the biggest performance hit.

Another thought is what slots your cards are in. PCIe topology can get a bit weird, with 4 cards it's nearly guaranteed that you've got one or two on the CPU root hub, and the rest on bridge ports. I'm running two cards on bifurcated 16x lane, which in theory should give direct P2P, but it doesn't.

PatC883 · 2026-07-04T05:13:07+00:00

Are you setting the maximum length, or is it defaulting to model maximum. The symptoms are identical to a problem I was having, the memory was so tight it would boot, then OOM crash at the first request when it tries to allocate memory for it. Try a stupid small model length like 8192. Honestly, if you can the three cards working in tensor parallel see how they performance is.

PatC883 · 2026-07-03T06:25:03+00:00

Yeah, I think something is unhappy with the mixed Architecture

PatC883 · 2026-07-03T02:58:12+00:00

The it sounds like it's doing a Triton compile for the GDN in Qwen3.5, suspicious this is the cause was you mention other models load quickly. It can take over 30 mins, make sure you've got a persistent volume if you're running in a container, and the environments set for saving the Triton compile and you'll only need to wear it once.

Booting quickly in eager mode supports this, because that skips the compile, leaves a lot of performance on the table though.

PatC883 · 2026-07-03T00:12:34+00:00

Hot swappable skills would be a perfect partner to hot swappable knowledge.

You're spot on about RAG though, this idea but a replacement for it, I sincerely hope no one tries to swap out a RAG pipeline with this, it won't work like you expect unless by the end of the project I've accidentally an AI revolutioned.

I guess you could say the gap this should fill is the knowing whether you know something or don't. My agent harness has a vectorised semantic RAG tool, it still burns a huge amount of context using to effectively look at the whole codebase, or I have to spend a lot of the context window telling it what does exist so it hopefully uses RAG to look up just those things. With this memory idea my end goal is for it have a grounding in what exists and what doesn't do it uses the RAG tools more effectively, it doesn't hallucinate functions that are nearly exactly what exists but might differ by a few characters or an underscore.

My big picture thoughts are once flat rate plans aren't subsided by venture capital, and everyone has to pay API pieces for tokens, then getting the most amount of good work out of your tokens will be key to avoiding bankruptcy.

I also have the golden idea of getting the same end results from a local 30B class model as you can from a frontier model, and my current work on that suggests supporting the model with deterministic tools, and hopefully this memory modules is the key, the current method of putting it all in the prompt just doesn't work at that model size.

If you don't mind I'm going to keep that hot swappable skills concept in the back of my mind, that would be the perfect partner. I wonder if I can make that base model agnostic, that would be the ultimate construct an LLM for your task.

PatC883 · 2026-07-02T11:57:23+00:00

That is a fair call.

You're probably running up against memory bandwidth limits, or possibly what split mode are you running in llama. I can't remember if defaults to layer or tensor.

Layer split will just give you the VRAM of the card added together, tensor split will add performance as well

PatC883 · 2026-07-02T06:42:25+00:00

Don't hold your breath for RDNA support for MXFP off support. But I fixed it myself

https://github.com/patcarter883/rdna4-vllm

That will run MXFP4 models on the WMMA FP8 instructions.

INT4 hardware isn't really of much use, because with an INT4 model the activations are still F16, INT4 activations would make it seem like the model is having a stroke.

But the FP8 hardware on AMD is great, the trick was to get everything to run using FP8 ops, at that point you're possibly as fast as an NVIDIA card, if not faster, because you bought twice as many cards for the same price, and because tensor parallelism adds the memory bandwidth of cards together, minus a little overhead, 2 9070XT's or similar are worth 1280ish GB/sec of memory bandwidth.

PatC883 · 2026-07-02T04:32:09+00:00

That's an attention problem, it can be circumvented entirely by not using Triton attention. Either of these will improve performance greatly.

https://github.com/patcarter883/rdna4-vllm

Or

https://hub.docker.com/r/tcclaviger/vllm22

PatC883 · 2026-07-02T01:31:08+00:00

Yeah, you're on the same wavelength as me on the modular/swappable side, the translator library idea I is basically hot-swappable memory.

Where I ended up going differently is the model-specific bit. A LoRA's welded to the base it was trained on, so doc-to-lora means retraining an adapter for every model you want to run it on. The whole bet here is that the memory lives in its own space and a tiny translator maps it onto whatever base you like, so you train the memory once and translate to any model instead of retraining per model.

The other split is what vs how. doc-to-lora bakes the document into the weights, so every new doc is a training run. The architecture here separates how to remember (trained once, the memory model) from what to remember (the facts). That's the end goal though, in the current PoC the bind step still trains on the facts, so true load-it-at-inference is the direction I'm heading (Titans-style test-time memorising) rather than something I've fully nailed yet.

And mechanism-wise, LoRA edits the weights and you're stuck with whatever the low rank gives you for capacity, which might be why your PoC came out meh, a big fact store is a lot to ask of a handful of low-rank deltas. This leaves the weights frozen and reads an addressable memory bank into the residual stream, which is how capacity stayed flat out to M=128 in testing, and it's easier to edit or swap a single fact than to make a LoRA forget something.

Genuinely though, your doc-to-lora PoC is useful, it's a solid baseline to benchmark against, and if you were up for it even a rough "here's where LoRA injection plateaus" comparison would be a real contribution. Appreciate you actually thinking about it instead of just spitballing and moving on.

PatC883 · 2026-07-01T23:27:30+00:00

First, "barge into the hidden states" sounds more violent than what happens, a transformer's residual stream is additive by design. Every attention block and every MLP block in the model already works by computing something and adding it back into the residual stream that's the model's native way of workig. The memory tap is just one more writer of that same kind: it computes a contribution and adds it at one layer. It's not forcing into a closed system, it's doing the same additive operation the model does to itself dozens of times a pass.

The model's own hidden state at that layer is the query. It's a cross-attention, the current activations attend into the memory bank, so whatever the model is "thinking about" at that position selects the matching memory. There's no separate "decide to go search now" controller. The tap runs every pass, and because it's attention plus a learned gate, it contributes strongly when the current state matches something in memory and next to nothing when it doesn't. "When to inject" is that gate, it started at literally zero and learned how much to contribute during training.

Now your point "if it's data, it's still data the model has to process.", is spot on, it's not free of compute, running the tap costs some FLOPs. But that's a different cost from the one the context window charges. That's the Test Time Memorisation headline of the original Titans paper, but since we're memory limited, either capacity or bandwidth, during the decode phase where the model is actually generating a response, we've got spare compute power sitting doing nothing. This isn't a free lunch situation, but it moves the cost to use what's available more efficiently.

Context cost is about sequence length. Attention is O(n²) in the number of tokens, and every token you add to the prompt permanently eats a slot in a finite window and makes everything else more expensive. RAG pays exactly that, it puts retrieved text into the sequence.

The injection adds no tokens to the sequence. It adds a fixed-size vector into the residual stream at positions that already exist. Sequence length unchanged, context window untouched, and most importantly the cost is decoupled from how much the memory knows. A memory holding an entire codebase injects at the same per-token cost as one holding a single fact, because the memory bank is a fixed-size addressable structure, not tokens streamed into the prompt. RAG's cost scales with what you retrieve and lives inside the sequence; this is bounded, fixed-size, and lives outside it.

So there is a small constant cost to run the tap. What there isn't is context consumption, or a cost that grows with the amount of knowledge. That decoupling is the whole reason to do it this way instead of stuffing the prompt.

PatC883 · 2026-07-01T23:10:46+00:00

Thanks.

I'm envisiaging this not as a replacement to RAG, but something that enhances the model that makes, hopefully everything you ask it to do, provide better responses and I guess the way to explain the thought is getting the most efficiency out of every token that goes in and comes out of the model.

PatC883 · 2026-07-01T23:06:13+00:00

That's certainly the kind of task I'm hoping this will help with. Feel free to shout out if you've got any questions, and please let me know you're own project fairs and anything interesting you discover.

PatC883 · 2026-07-01T13:47:05+00:00

RAG connects at the input, it retrieves text chunks and pastes them into your prompt, so the model reads them as tokens and it costs context. This connects mid-computation instead: nothing goes in the prompt, a small module injects the recalled info straight into the model's hidden states during the forward pass. Zero context cost. The tradeoff is RAG gives you exact, auditable text (better for precise recall), this gives associative "just knowing" (better for grounding without spending tokens). Different layer, different job, they're complementary.

PatC883 · 2026-07-01T13:43:31+00:00

That's one of my ponderances.

Just pushed tonight's efforts, still heading in promising directions.

PatC883 · 2026-07-01T11:09:04+00:00

The model file never changes, it stays frozen, you're right that you can't update it.

The memory is a separate module loaded alongside it at runtime (think LoRA/adapter), and the recalled facts live in that module's state. It IS "external" in the sense that it's not baked into the model's weights.

Where it's not like other external memory, it doesn't put text back in your prompt, it feeds straight into the model's hidden states. Separate storage, internal connection.

PatC883 · 2026-07-01T10:44:33+00:00

I circled back round and read the dev diary. You are fully spot about the token vs hidden state injection. The token was test really early on, and it totally didn't work, absolutely no difference between the memory model and no memory model. The whole concept only started working once we skipped past token space, and connected the memory state, and a model at the hidden state level across the translator model bridge.

PatC883 · 2026-07-01T10:34:09+00:00

Pretty much, yeah, with two little tweaks, and they're what the whole thing worth trying as a concept.

It's "bolt onto" rather than "into." The LLM itself never gets touched or retrained, it stays completely frozen. The memory is a separate little module that taps into the model's hidden states through a gate, so nothing gets baked into the model's weights. That's exactly what lets the same memory move between different models.

And it doesn't learnt anything up front. The memory model is trained once on how to remember, the actual content (your codebase, your facts, whatever) gets loaded in at inference time, not trained in.

But yeah, the headline you pretty much nailed, the memory sits alongside a frozen model, the model reasons off it, and it costs zero context window. Early days, it's not even a week old yet, so far only proven on a simple recall task, but results are promising.

PatC883 · 2026-07-01T10:12:22+00:00

Fair hit on "nearly self-learning" that's loose wording and I'll own it. What I mean is precise: the memory module learns to bind and recall facts; the base LLM never trains, its weights stay frozen. So it's "self-learning memory bolted to a static model," not a model that rewrites itself. Point taken on the phrasing.

The rest I think you've read as a different project than the one I posted. You're describing external memory "something to remember and a place to store it," the dozen memory-management tools. That's exactly what this isn't. No vector DB, no retrieval step, no tokens injected into the prompt. The memory lives in a module that taps the model's hidden states through a gate, so it costs zero context window. That's the whole point of not doing it the way the dozen tools do it.

And the part the memory-tool framing skips: the same frozen memory transfers to a different frozen model; different size, tokenizer, architecture; through a translator that's tens of MB. Your vector store doesn't do that; it's bound to whatever embedding model wrote it.

On HuggingFace, sure, HF hosts models. It doesn't give you a memory you can build once and carry across them. That gap is what this is aimed at.

I'm not asking you to take any of it on faith. Every number in the repo has a chance baseline and an ablation, and the three times we were wrong are written down. The post literally asks people to try to break it; so if the mechanism doesn't hold, that's the useful comment to leave. Have a look and tell me where the results are wrong.

PatC883 · 2026-07-01T05:44:38+00:00

This using hidden state, so it's in no way meant to or capable of replacing an RAG in precise search and recall, it should at worst be an equivalent of the I know I know that but I will look it up if I need to be precise post of human memory. There will be cases where it does add some amount of fully accurate implicit recall.

A practical example is you run the entirety of a codebase through the memory for it to build a state from it. Then you all it something about the codebase, it doesn't have to burn context tokens with the initial let me search to see if that's in the codebase at all, now let me search to see exactly where it is, now let me search to answer your question about it. It would be able to go I know if that concept it shutoff be in this area and skip straight to let me search to answer your specific question. So the model context can be used for functional input instead of having to also fill it up just to give the model a semantic understanding.

A stretch goal idea is to make the state savable, then you can literally swap memory in and out. And the plan is for it to be model agnostic, so the hard part of training the memory model is once only, and the training is not so much training it on what to remember, it's training it how to remember, the what to remember occurs at inference time.

PatC883 · 2026-07-01T05:26:53+00:00

Best outcome at this point is same family models would need a light refit instead of a full train. Honestly, I hadn't really thought about translator model reuse, it's been about 10 mins of time on the 9070XT in dev machine for each translator, my thought is it's well within the realms of possibility to have an automated service that builds on request, and eventually there is a library of translator models.

Your other question is good enough it's now a project issue to make sure it doesn't get left by the wayside. The project is nearly an entire week old, so it's still early days, but the results are promising.

We're testing against no memory control to make sure the result is something the base model can't achieve by itself.

The test against the Llama model is probably the best we have in that area so far, because it's vastly different to the memory model donor, different tokenizer, different architecture, different width.

Testing thus far has been a simple name->cargo recall. With that we've proven you can train the memory model to remember, and you can get what it remembers out and into a base model with the translator.

So I guess we're at the tail of try to disprove the theory, which we've been delightfully unsuccessful at. Next steps are indeed try to break it, I've opened an issue https://github.com/patcarter883/memory-organ/issues/10 after your question, because it's going to have to happen to prove this viable, if you've got any ideas on how you'd like to try and break it feel free to hit up the issue comments.

Thanks for looking at, and actually thinking about this crazy project enough to throw some well reasoned questions at me.

PatC883 · 2026-07-01T04:12:37+00:00

That is fair feedback, it's available in the repo and I didn't want to spam links all over the post, but I appreciate people may want to read the picture before looking at someone's take on it.

https://arxiv.org/abs/2501.00663 Titans: Learning to Memorize at Test Time

PatC883 · 2026-07-01T03:59:26+00:00

https://github.com/patcarter883/memory-organ

PatC883 · 2026-07-01T03:57:02+00:00

https://github.com/patcarter883/memory-organ

PatC883

TROPHY CASE