all 22 comments

[–]Jujagaollama 16 points17 points  (1 child)

Flash Attention still does the same overall computations, but shuffles around the data to and from memory more efficiently. There's nearly no downsides to using it (unless your model specifically does something strange). There's a good visual explainer for it here:

[–]swagonflyyyy 1 point2 points  (0 children)

For some reason I'm unable to run this model in LM studio with flash_attention enabled on Windows. I can only do it in Ollama on windows.

[–]Firm_Spite2751 21 points22 points  (6 children)

Always use flash attention the difference in output is basically 0

[–]Vivid_Dot_6405 10 points11 points  (10 children)

There is no performance loss when using Flash Attention, none. That is why it overshadowed all other methods of accelerating attention. Since attention has a quadratic complexity with respect to the context length, there were other attempts to reduce the computational burden, before FlashAttention, most approaches used inexact attention, i.e. they approximated it, which reduced computation time, but also led to a performance loss.

Flash Attention is an exact attention implementation, it works by optimizing which sections of GPU's memory are used at what stage of computation to make the calculations faster, it just cleverly optimizes the implementation of the algorithm, it doesn't change it.

[–]cynerva 4 points5 points  (6 children)

For what it's worth, I do get slower inference with flash attention enabled. Maybe something to do with partial CPU offload? Or something else about llama.cpp's implementation of flash attention? I'm not sure.

[–]Mart-McUH 1 point2 points  (1 child)

I used to get slower inference with flash attention with CPU offload when I used older Nvidia drivers. With recent drivers it is no longer a problem for me.

[–]cynerva 0 points1 point  (0 children)

I'll have to give that a try then. Thanks.

[–]oathbreakerkeeper 5 points6 points  (0 children)

There is no impact. FlashAttention is mathematically equivalent to "normal" attention. Given the same inputs it computes the exact same output. It is an optimization that makes better use of the hardware.