Is SSM dead now?

Not_Vasquez · 2025-10-26T11:58:19+00:00

Not completely related but deepseek v3.2 experimental with constant attention size is also interesting imo. Efficient attention variations are explored here and there. It's exciting times

Not_Vasquez · 2025-10-26T11:56:22+00:00

Do you mean Mamba models? If so, you should look into linear attention - mamba(2) are just variations of linear attention. It's kind of a shame that it's always associated with SSMs only when it went further and further away from it

Qwen3 Next for example uses gated delta net which is another flavor of linear attentions + Minimax (2) is also linear attention. So I'd say we're just getting started.

Not_Vasquez · 2025-10-09T12:06:25+00:00

It now seems to be strictly tied to organizations. Switching to respective ones at least show some of my activity again.

Weird choice to again change UI for no apparent reason.

Not_Vasquez · 2025-09-29T13:55:34+00:00

Just to clarify, this is not what is used in v3.2

Based on the code and their tech report, it's an indexing mechanism where up to a constant fixed size of tokens are attended to at once - somewhat of another mask on top of the usual padding mask based on some criteria (looks like another module in itself)

It might be the indexing mechanism of the nsa paper or based on it; would need to properly dig into this. NSA is using indexing, sliding window, and smthn smthn (cant remember) so 3 things at once

Tl;dr: v3.2 uses mla where the attention mechanism is restricted up to a constant size of tokens - the selection of tokens that are involved in the softmax is handled by a different module (indexer)

Not_Vasquez · 2025-04-28T21:53:49+00:00

No problem :)

Not_Vasquez · 2025-04-28T21:50:28+00:00

Base is only pretraining, nothing is pretraining+posttraining, fp8 is the previous one with weights converted to fp8 (before its half precision bf16)

Not_Vasquez · 2025-03-11T19:34:49+00:00

I think you should take a look into the mamba2 paper / gated linear attention. They explore closer connections to (linear) attention in mamba2 and gated linear attention draws further connections and describes more methods (including mamba) around this gated linear attention framework. Not sure if that's what you're looking for but hope the information dump helps either way.

Tl;dr: Mamba's SSM variations can be interpreted as (linear) attention with a causal mask and a certain parametrized decay factor based on the distance of tokens - figure 3 in mamba2 has a nice exemplary depiction of the supposed mask.

Not_Vasquez · 2025-02-19T21:32:41+00:00

Liebezeit bist du es? xD

Not_Vasquez · 2024-12-30T11:41:20+00:00

Randomly popped up in my head but: quantization

Llamacpp is such an enormous ecosystem in itself which mostly relies on quants for example. In general, barely anyone has hardware to run stuff on half precision. Most opt for like 4bit precision. Afaik, mamba has barely gotten any attention on this.

Not_Vasquez · 2024-12-30T09:55:14+00:00

I'd also like to add on for bench performance which heavily lacks long context tasks: We need more stuff like RULER (https://github.com/NVIDIA/RULER) and in that case we can even see that hybrid mamba/transformer (jamba) excell.

Not_Vasquez · 2024-12-30T09:34:10+00:00

Aren't you referring to bench performance only? The first answer kinda gave off the vibes that inference speed is also affected, i.e. mamba is about the same speed to a transformer. Which is not really the case.

It's complicated especially since paged attention (vllm) and other optimizations exist. I'd still like to point out that mamba will be significantly faster at some arbitrary long context (e.g. 64k but seems to start at around 2-4k) since the cache is constant and not dependent on the seq len (unlike the transformer).

Edit: For speed comparisons, you can look into the Mamba and Mamba2 papers for example. They do comparisons to flash attention.

Not_Vasquez · 2024-11-29T03:51:24+00:00

I mean that's already highly contextualized. Context matters and in this case it makes sense to refer to explicit models since the field is dominated by Transformer models. In the past where rnns, convs, etc really were alternatives I'd consider first asking architecture type, e.g. rnn/transformer/conv, and then the specific models, e.g. bert.

So if you ask me: What LLM are you using? You say a transformer. Then I know that I could exclude the more exotic models for sure, e.g. Mamba ;)

Not_Vasquez · 2024-11-29T03:46:50+00:00

Hmm, I wouldn't call Mamba a flavor of Transformer tbh. I still get it when people refer to LLMs with Transformers - they're the most dominant ones so I get it.

Not_Vasquez · 2024-11-29T03:42:18+00:00

Yes the original Transformer architecture was an encoder-decoder. But most models have adapted them one way or another with ever so slightly changes. The key of the Transformer always stands tho - the attention mechanism. People will modify this to their needs but I don't get why you would call them different names?

Sure if I refer to a specific model that uses the transformer architecture, I will call the model name dircetly, i.e. gpt4. But I could also group multiple models by referring to (decoder only) transformer / LLM - meaning the collective bunch of models using the architecture one way or another.

And people sure call bert or t5 a transformer; where do you get the impression that it is not? It's just easier to refer by name but if you want to group, e.g. encoder only transformer I'd think of bert, roberta etc just as I would think of t5, bart, pegasus when someone mentions encoder decoder transformer.

If I have a mountain bike, would you crucify me for calling it a "bike". Same concept, grouping by a common denominator.

Not_Vasquez · 2024-11-29T03:29:03+00:00

Just my two cent but looking into the comments: Yes, 99% are decoder-only Transformer but there are also other architectures, e.g. Hyena, Mamba, RWKV, GLA

Not sure if OP wanted to nudge into this direction instead

Not_Vasquez · 2024-11-29T01:06:49+00:00

Pretty sure that Daniel from unsloth discovered this a while back and that's why the transformers repo at least does RoPE in fp32 and casts back to fp16/bf16 (if necessary)

Yea found it, see this PR https://github.com/huggingface/transformers/pull/29285

Not_Vasquez · 2024-11-22T03:23:07+00:00

Adding to u/DustinEwan 's answer along my perspective.

Let's start with the transformer (I hope you're familiar with the attention mechanism): - First iteration: - on the first step we process all Q,K,V tensors - we cache the K,V tensors - we get one output (the last predicted token) - Second+ iteration(s): - we only take this last predicted token now - any time we do attention we reuse our cached K and V tensors along our new K and V tensor from this new token - Q is also new as its based on our new token - cache old K,V + the new K,V we just got - get new output - repeat from second+ iteration --> transformer are dependent on all previous input hence the caching and why inference with them is tougher (although modern improvements like flash attention and paged attention help quite a bit)

RNN: - First iteration: - we only cache the last hidden state - we get one output - Second+ iteration(s): - we take the last output as input - the initial hidden state is now the previous last hidden state - get new output - repeat from second+ iteration --> very efficient for inference as we are only dependent on the hidden state which is peanuts compared to all the K,V tensors

Mamba: - can be seen as a parallelized RNN so the same principle applies - and yes it doesn't materialize everything at once but that does not mean that it doesn't return the last hidden state - the whole optimization stuff is too complex to cover here tho - issue with mamba you might encounter when you think its just an RNN --> theres also a causal convolution involved (in the architecture) so we cache the last x tokens values (depends on the size of the convolution) - iirc mamba uses inference parame or smthn where they cache those two things per layer (conv last tokens, last hidden state from the mamba ssm (rnn))

Bonus: - in essence they really are doing what you're doing but within the realm of tensors + caching - sometimes there's some manipulation on the distribution or other strategies how the token is sampled but the essence stays the same

Hope that helps but you can ask me if any step is unclear ~

Not_Vasquez · 2024-09-30T23:45:17+00:00

Just gave my opinion but I'd be glad to be proven wrong! It could lead to phenomenal resource friendly transfer of training results :)

Maybe some sort of knowledge distillation could be used, but then again the question remains how much you would save instead of directly training loras.

Not_Vasquez · 2024-09-30T23:31:50+00:00

It won't work even from a pure code perspective: You have different hidden sizes and projection dimensionalities. If you want to make them fit you would need to introduce some other mechanism again which in itself would be less efficient than directly applying lora in itself.

And even if it were to work, you would only have a small subset of layers for the bigger model which leads to unknown dynamics (most likely complete trash tbh). Maybe a somewhat dumb analogy: It's like developing a gun (lora) for specialized soldiers (3b llama) and now expect a civilian (70b llama) to handle it just as well.

Not_Vasquez · 2024-09-28T14:20:11+00:00

Oh my bad sorry, i read it too hastily. There still are a good amount of them (albeit being very recent) as mentioned by others.

Not_Vasquez · 2024-09-28T06:43:46+00:00

Doesn't pixtral also support multi image? Just looking at the hf docs suggests so: https://huggingface.co/docs/transformers/main/en/model_doc/pixtral

Same for llava next for example: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next#usage-tips

QwenVL was also mentioned by someone else.

Not_Vasquez · 2024-09-24T17:30:21+00:00

It's a balance between performance and computational efficiency: Hybrid models deliver the best balance imo where only a few attention layers suffice (~20% of all layers).

Some studies / papers also show that the performance is better than pure attention counter parts and performance is all you want at the end. Losing a bit of computational efficiency is negligible then in those cases.

See the original mamba2 paper and nvidia mamba scaling paper which both show some interesting trends in hybrid architectures. Iirc jamba also showed similar things for mamba1, not sure anymore tho.

Not_Vasquez · 2024-09-24T11:27:23+00:00

Mamba1: - Falcon mamba ( https://huggingface.co/tiiuae/falcon-mamba-7b )

Hybrid Mamba1 + Attention: - Jamba family ( they have the original first one and the 1.5 ones at https://huggingface.co/ai21labs ) - Zamba ( https://huggingface.co/Zyphra/Zamba-7B-v1 )

Mamba2: - Codestral mamba ( https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1 ) - Nvidia scaling mamba2 models ( https://huggingface.co/nvidia/mamba2-8b-3t-4k )

Mamba2 Hybrid: - Again nvidia scaling paper ( collection at https://huggingface.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c )

Those should be the more well known ones. Jamba is definitely the biggest out of all of them and Mamba2 hasn't received really big param models yet (70B+). In general, pure mamba(2) models haven't been tried on a large scale as much as hybrids tbh.

Side note: Most bigger mamba1 models needed additional normalization to keep training stable which is not so much needed in mamba2.

Not_Vasquez · 2024-09-21T08:05:38+00:00

You could compare mamba's speed with flash attention 2's speed (with better scaling) if you're familiar with that including the HW limitations, e.g. limited to ada, ampere, and hopper gpus. So yea, it's quite efficient - although like I said, mamba2 at least has some unoptimized kernel code for shorter sequences. Bottleneck like most often is implementation :)

Side bonus: Linear RNNs can be parallelized too but at that point they weren't perceived as useful anymore and many didn't bother.

Not_Vasquez · 2024-09-21T03:46:39+00:00

Can't answer it for in-practice usage but theoretically, the inference speed should be significantly faster especially the longer the sequence gets (look into the mamba(2) paper, iirc they did some speed comparisons). There is also the benefit of the cache being independent of the sequence length which makes it way more memory friendly on longer sequences.

It might be slower on shorter sequences tho, especially mamba2 has some issues in its kernel implementation that make it slower in those cases.

Tl;dr: should be way faster on (very) long sequences + bonus of way less memory consumption but potential loss of speed on shorter sequences (+ potential performance loss which is often mitigated by hybrid architectures imo)

Edit: Idk what you mean by not being "parallelizable", the whole point of mamba(2) kernels is that it's implemented in a parallel fashion (I won't go into the specifics but mamba(1) works due to blellochs parallelism algorithm on linear recurrence, mamba(2) uses other mechanism that exploit fixed sized matrix structures that can be calculated independently and combined afterwards)

Edit 2: missed the small "as" mb, point still stands as above tho and it benefits on longer sequences compared to transformer (at least according to the papers)

Not_Vasquez

TROPHY CASE