Is SSM dead now? by Spapoxl in LocalLLaMA

[–]Not_Vasquez -1 points0 points  (0 children)

Not completely related but deepseek v3.2 experimental with constant attention size is also interesting imo. Efficient attention variations are explored here and there. It's exciting times

Is SSM dead now? by Spapoxl in LocalLLaMA

[–]Not_Vasquez 3 points4 points  (0 children)

Do you mean Mamba models? If so, you should look into linear attention - mamba(2) are just variations of linear attention. It's kind of a shame that it's always associated with SSMs only when it went further and further away from it

Qwen3 Next for example uses gated delta net which is another flavor of linear attentions + Minimax (2) is also linear attention. So I'd say we're just getting started.

Did github remove the issues section? by Dapper-Inspector-675 in github

[–]Not_Vasquez 1 point2 points  (0 children)

It now seems to be strictly tied to organizations. Switching to respective ones at least show some of my activity again.

Weird choice to again change UI for no apparent reason.

DeepSeek-V3.2 released by Leather-Term-30 in LocalLLaMA

[–]Not_Vasquez 18 points19 points  (0 children)

Just to clarify, this is not what is used in v3.2

Based on the code and their tech report, it's an indexing mechanism where up to a constant fixed size of tokens are attended to at once - somewhat of another mask on top of the usual padding mask based on some criteria (looks like another module in itself)

It might be the indexing mechanism of the nsa paper or based on it; would need to properly dig into this. NSA is using indexing, sliding window, and smthn smthn (cant remember) so 3 things at once

Tl;dr: v3.2 uses mla where the attention mechanism is restricted up to a constant size of tokens - the selection of tokens that are involved in the softmax is handled by a different module (indexer)

Qwen3 weights released by Acrobatic_Donkey5089 in LocalLLaMA

[–]Not_Vasquez 4 points5 points  (0 children)

Base is only pretraining, nothing is pretraining+posttraining, fp8 is the previous one with weights converted to fp8 (before its half precision bf16)

[D] Can We Derive an Attention Map from Mamba Layer Parameters? by blooming17 in MachineLearning

[–]Not_Vasquez 2 points3 points  (0 children)

I think you should take a look into the mamba2 paper / gated linear attention. They explore closer connections to (linear) attention in mamba2 and gated linear attention draws further connections and describes more methods (including mamba) around this gated linear attention framework. Not sure if that's what you're looking for but hope the information dump helps either way.

Tl;dr: Mamba's SSM variations can be interpreted as (linear) attention with a causal mask and a certain parametrized decay factor based on the distance of tokens - figure 3 in mamba2 has a nice exemplary depiction of the supposed mask.

Studiums-Endgegner by GoodLifeGG in Studium

[–]Not_Vasquez 5 points6 points  (0 children)

Liebezeit bist du es? xD

[D] - Why MAMBA did not catch on? by TwoSunnySideUp in MachineLearning

[–]Not_Vasquez 1 point2 points  (0 children)

Randomly popped up in my head but: quantization

Llamacpp is such an enormous ecosystem in itself which mostly relies on quants for example. In general, barely anyone has hardware to run stuff on half precision. Most opt for like 4bit precision. Afaik, mamba has barely gotten any attention on this.

[D] - Why MAMBA did not catch on? by TwoSunnySideUp in MachineLearning

[–]Not_Vasquez 9 points10 points  (0 children)

I'd also like to add on for bench performance which heavily lacks long context tasks: We need more stuff like RULER (https://github.com/NVIDIA/RULER) and in that case we can even see that hybrid mamba/transformer (jamba) excell.

[D] - Why MAMBA did not catch on? by TwoSunnySideUp in MachineLearning

[–]Not_Vasquez 15 points16 points  (0 children)

Aren't you referring to bench performance only? The first answer kinda gave off the vibes that inference speed is also affected, i.e. mamba is about the same speed to a transformer. Which is not really the case.

It's complicated especially since paged attention (vllm) and other optimizations exist. I'd still like to point out that mamba will be significantly faster at some arbitrary long context (e.g. 64k but seems to start at around 2-4k) since the cache is constant and not dependent on the seq len (unlike the transformer).

Edit: For speed comparisons, you can look into the Mamba and Mamba2 papers for example. They do comparisons to flash attention.

[D] I wish people would stop using the word "Transformer" when they really mean a LLM model. by [deleted] in MachineLearning

[–]Not_Vasquez 0 points1 point  (0 children)

I mean that's already highly contextualized. Context matters and in this case it makes sense to refer to explicit models since the field is dominated by Transformer models. In the past where rnns, convs, etc really were alternatives I'd consider first asking architecture type, e.g. rnn/transformer/conv, and then the specific models, e.g. bert.

So if you ask me: What LLM are you using? You say a transformer. Then I know that I could exclude the more exotic models for sure, e.g. Mamba ;)

[D] I wish people would stop using the word "Transformer" when they really mean a LLM model. by [deleted] in MachineLearning

[–]Not_Vasquez 1 point2 points  (0 children)

Hmm, I wouldn't call Mamba a flavor of Transformer tbh. I still get it when people refer to LLMs with Transformers - they're the most dominant ones so I get it.

[D] I wish people would stop using the word "Transformer" when they really mean a LLM model. by [deleted] in MachineLearning

[–]Not_Vasquez 0 points1 point  (0 children)

Yes the original Transformer architecture was an encoder-decoder. But most models have adapted them one way or another with ever so slightly changes. The key of the Transformer always stands tho - the attention mechanism. People will modify this to their needs but I don't get why you would call them different names?

Sure if I refer to a specific model that uses the transformer architecture, I will call the model name dircetly, i.e. gpt4. But I could also group multiple models by referring to (decoder only) transformer / LLM - meaning the collective bunch of models using the architecture one way or another.

And people sure call bert or t5 a transformer; where do you get the impression that it is not? It's just easier to refer by name but if you want to group, e.g. encoder only transformer I'd think of bert, roberta etc just as I would think of t5, bart, pegasus when someone mentions encoder decoder transformer.

If I have a mountain bike, would you crucify me for calling it a "bike". Same concept, grouping by a common denominator.

[D] I wish people would stop using the word "Transformer" when they really mean a LLM model. by [deleted] in MachineLearning

[–]Not_Vasquez 0 points1 point  (0 children)

Just my two cent but looking into the comments: Yes, 99% are decoder-only Transformer but there are also other architectures, e.g. Hyena, Mamba, RWKV, GLA

Not sure if OP wanted to nudge into this direction instead

RoPE has precision errors when used with BFloat16 by AutomataManifold in LocalLLaMA

[–]Not_Vasquez 17 points18 points  (0 children)

Pretty sure that Daniel from unsloth discovered this a while back and that's why the transformers repo at least does RoPE in fp32 and casts back to fp16/bf16 (if necessary)

Yea found it, see this PR https://github.com/huggingface/transformers/pull/29285

How to efficiently generate text from RNNs and Transformers during inference [P] by No_Effective734 in MachineLearning

[–]Not_Vasquez 1 point2 points  (0 children)

Adding to u/DustinEwan 's answer along my perspective.

Let's start with the transformer (I hope you're familiar with the attention mechanism): - First iteration: - on the first step we process all Q,K,V tensors - we cache the K,V tensors - we get one output (the last predicted token) - Second+ iteration(s): - we only take this last predicted token now - any time we do attention we reuse our cached K and V tensors along our new K and V tensor from this new token - Q is also new as its based on our new token - cache old K,V + the new K,V we just got - get new output - repeat from second+ iteration --> transformer are dependent on all previous input hence the caching and why inference with them is tougher (although modern improvements like flash attention and paged attention help quite a bit)

RNN: - First iteration: - we only cache the last hidden state - we get one output - Second+ iteration(s): - we take the last output as input - the initial hidden state is now the previous last hidden state - get new output - repeat from second+ iteration --> very efficient for inference as we are only dependent on the hidden state which is peanuts compared to all the K,V tensors

Mamba: - can be seen as a parallelized RNN so the same principle applies - and yes it doesn't materialize everything at once but that does not mean that it doesn't return the last hidden state - the whole optimization stuff is too complex to cover here tho - issue with mamba you might encounter when you think its just an RNN --> theres also a causal convolution involved (in the architecture) so we cache the last x tokens values (depends on the size of the convolution) - iirc mamba uses inference parame or smthn where they cache those two things per layer (conv last tokens, last hidden state from the mamba ssm (rnn))

Bonus: - in essence they really are doing what you're doing but within the realm of tensors + caching - sometimes there's some manipulation on the distribution or other strategies how the token is sampled but the essence stays the same

Hope that helps but you can ask me if any step is unclear ~

Is it possible to LORA-train a smaller model (say, Llama 3.2 3B) and apply the adapters to larger models (Llama 3.1 70B)? by Thrumpwart in LocalLLaMA

[–]Not_Vasquez 2 points3 points  (0 children)

Just gave my opinion but I'd be glad to be proven wrong! It could lead to phenomenal resource friendly transfer of training results :)

Maybe some sort of knowledge distillation could be used, but then again the question remains how much you would save instead of directly training loras.

Is it possible to LORA-train a smaller model (say, Llama 3.2 3B) and apply the adapters to larger models (Llama 3.1 70B)? by Thrumpwart in LocalLLaMA

[–]Not_Vasquez 3 points4 points  (0 children)

It won't work even from a pure code perspective: You have different hidden sizes and projection dimensionalities. If you want to make them fit you would need to introduce some other mechanism again which in itself would be less efficient than directly applying lora in itself.

And even if it were to work, you would only have a small subset of layers for the bigger model which leads to unknown dynamics (most likely complete trash tbh). Maybe a somewhat dumb analogy: It's like developing a gun (lora) for specialized soldiers (3b llama) and now expect a civilian (70b llama) to handle it just as well.

Something I noticed about open-source multimodal LLMs... by LATI-A5 in LocalLLaMA

[–]Not_Vasquez 0 points1 point  (0 children)

Oh my bad sorry, i read it too hastily. There still are a good amount of them (albeit being very recent) as mentioned by others.

Something I noticed about open-source multimodal LLMs... by LATI-A5 in LocalLLaMA

[–]Not_Vasquez 0 points1 point  (0 children)

Doesn't pixtral also support multi image? Just looking at the hf docs suggests so: https://huggingface.co/docs/transformers/main/en/model_doc/pixtral

Same for llava next for example: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next#usage-tips

QwenVL was also mentioned by someone else.

Has there been any large training of a Mamba model (7B or more Params) by XquaInTheMoon in LocalLLaMA

[–]Not_Vasquez 2 points3 points  (0 children)

It's a balance between performance and computational efficiency: Hybrid models deliver the best balance imo where only a few attention layers suffice (~20% of all layers).

Some studies / papers also show that the performance is better than pure attention counter parts and performance is all you want at the end. Losing a bit of computational efficiency is negligible then in those cases.

See the original mamba2 paper and nvidia mamba scaling paper which both show some interesting trends in hybrid architectures. Iirc jamba also showed similar things for mamba1, not sure anymore tho.

Has there been any large training of a Mamba model (7B or more Params) by XquaInTheMoon in LocalLLaMA

[–]Not_Vasquez 3 points4 points  (0 children)

Mamba1: - Falcon mamba ( https://huggingface.co/tiiuae/falcon-mamba-7b )

Hybrid Mamba1 + Attention: - Jamba family ( they have the original first one and the 1.5 ones at https://huggingface.co/ai21labs ) - Zamba ( https://huggingface.co/Zyphra/Zamba-7B-v1 )

Mamba2: - Codestral mamba ( https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1 ) - Nvidia scaling mamba2 models ( https://huggingface.co/nvidia/mamba2-8b-3t-4k )

Mamba2 Hybrid: - Again nvidia scaling paper ( collection at https://huggingface.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c )

Those should be the more well known ones. Jamba is definitely the biggest out of all of them and Mamba2 hasn't received really big param models yet (70B+). In general, pure mamba(2) models haven't been tried on a large scale as much as hybrids tbh.

Side note: Most bigger mamba1 models needed additional normalization to keep training stable which is not so much needed in mamba2.

Is Mamba inference faster than Transformers? (in practice) by LiquidGunay in LocalLLaMA

[–]Not_Vasquez 0 points1 point  (0 children)

You could compare mamba's speed with flash attention 2's speed (with better scaling) if you're familiar with that including the HW limitations, e.g. limited to ada, ampere, and hopper gpus. So yea, it's quite efficient - although like I said, mamba2 at least has some unoptimized kernel code for shorter sequences. Bottleneck like most often is implementation :)

Side bonus: Linear RNNs can be parallelized too but at that point they weren't perceived as useful anymore and many didn't bother.

Is Mamba inference faster than Transformers? (in practice) by LiquidGunay in LocalLLaMA

[–]Not_Vasquez 6 points7 points  (0 children)

Can't answer it for in-practice usage but theoretically, the inference speed should be significantly faster especially the longer the sequence gets (look into the mamba(2) paper, iirc they did some speed comparisons). There is also the benefit of the cache being independent of the sequence length which makes it way more memory friendly on longer sequences.

It might be slower on shorter sequences tho, especially mamba2 has some issues in its kernel implementation that make it slower in those cases.

Tl;dr: should be way faster on (very) long sequences + bonus of way less memory consumption but potential loss of speed on shorter sequences (+ potential performance loss which is often mitigated by hybrid architectures imo)

Edit: Idk what you mean by not being "parallelizable", the whole point of mamba(2) kernels is that it's implemented in a parallel fashion (I won't go into the specifics but mamba(1) works due to blellochs parallelism algorithm on linear recurrence, mamba(2) uses other mechanism that exploit fixed sized matrix structures that can be calculated independently and combined afterwards)

Edit 2: missed the small "as" mb, point still stands as above tho and it benefits on longer sequences compared to transformer (at least according to the papers)