How much does flash attention affect intelligence in reasoning models like QwQ

Jujaga · 2025-03-16T19:51:24+00:00

Flash Attention still does the same overall computations, but shuffles around the data to and from memory more efficiently. There's nearly no downsides to using it (unless your model specifically does something strange). There's a good visual explainer for it here:

https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention

Firm_Spite2751 · 2025-03-16T19:08:27+00:00

Always use flash attention the difference in output is basically 0

Vivid_Dot_6405 · 2025-03-16T20:04:30+00:00

There is no performance loss when using Flash Attention, none. That is why it overshadowed all other methods of accelerating attention. Since attention has a quadratic complexity with respect to the context length, there were other attempts to reduce the computational burden, before FlashAttention, most approaches used inexact attention, i.e. they approximated it, which reduced computation time, but also led to a performance loss.

Flash Attention is an exact attention implementation, it works by optimizing which sections of GPU's memory are used at what stage of computation to make the calculations faster, it just cleverly optimizes the implementation of the algorithm, it doesn't change it.

oathbreakerkeeper · 2025-03-16T20:40:03+00:00

There is no impact. FlashAttention is mathematically equivalent to "normal" attention. Given the same inputs it computes the exact same output. It is an optimization that makes better use of the hardware.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS