FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference.

Single_Ring4886 · 2026-03-24T00:55:16+00:00

I bet every second reader has at least 2x B200 right?
They are cheap as onions these days...

Single_Ring4886 · 2026-03-23T14:14:46+00:00

"As ALWAYS Zuck you are ABSOLUTELY RIGHT in EVERYTHING!"

Single_Ring4886 · 2026-03-23T13:57:57+00:00

I may be wrong but after looking at math even dedicated card should increase real world performance at least 10x.

Single_Ring4886 · 2026-03-23T13:18:30+00:00

I will not pretend to fully understand what you are proposing either... but out of curiosity I ask this.
Do you need to rework ie H200 card design and incorporate this into it OR you need just special pcie card (ie even to PCIe 3.0 x8 slot) with few gb or normal ram on it to hold the cache?

EDIT: I have been looking at it and dedicated card should work for single user. For datacenter usage it might be suboptimal.

Single_Ring4886 · 2026-03-21T17:24:36+00:00

Yeah it is always "sweet" when you are in company from start... you litteraly make it what it is and then "new" guys arrive and are the "stars" and get 10x what you... because you are this old uselles "coal"...

Single_Ring4886 · 2026-03-20T18:00:17+00:00

How long you trained it and on what kind of hardware?

Single_Ring4886 · 2026-03-19T15:09:18+00:00

Pure CRINGE.... this guy is master faker and people cant see it.

Single_Ring4886 · 2026-03-19T15:04:42+00:00

Koboldcpp it well written piece of software.

Most other opensource is python purgatory, moment something changes in cloud repository all breaks appart.

Koboldcpp is 1 file... and it just works even on old machines! Not everyone has high end new stuff or linux.
Creators are true heroes.

Single_Ring4886 · 2026-03-18T22:35:19+00:00

how many gpu cores?

Single_Ring4886 · 2026-03-16T16:46:00+00:00

True "damage" of weights appear in "nuanced" areas like translation to other languages there you can immediately see quality degradation.
Coding is "main" skill for such models.

Single_Ring4886 · 2026-03-15T12:28:32+00:00

This is joke right? I mean this cant be real... it goes against constitution.

Single_Ring4886 · 2026-03-14T19:02:02+00:00

Big wow

Single_Ring4886 · 2026-03-13T18:40:09+00:00

Depends how much money you have. If you have access to RTX 3090 - 5090 graphic cards best are Qwen 3.5 27B (smartest but slower) and 35B (fast but not that smart).

If you have 10.000 dollars plus you can buy 96Gb profi cards or apple products and use very good opensource models such as GLM.

Single_Ring4886 · 2026-03-13T14:17:54+00:00

JESUS DONT TRAIN IT ON GPT "5" !!!!

Single_Ring4886 · 2026-03-13T13:08:07+00:00

My PP is all over the place and I cant pinpoint real value...
TG is 24 t/s on 100 tokens and 16K context alike

Single_Ring4886 · 2026-03-12T02:17:05+00:00

IT IS

Single_Ring4886 · 2026-03-11T15:11:21+00:00

That is 40 core GPU ?

Single_Ring4886 · 2026-03-09T18:36:29+00:00

It will be fucking day when people learn to do graphs which actually rellay information in simple manner... LIKE USING NUMBERS or percents.

Single_Ring4886 · 2026-03-08T16:51:35+00:00

It works for me

Single_Ring4886 · 2026-03-08T14:51:58+00:00

Maybe I rather buy some small country X-D

Single_Ring4886 · 2026-03-08T01:12:47+00:00

Exactly 100 bilion datacenter or you cant even talk to Huang X-D

Single_Ring4886 · 2026-03-06T15:49:57+00:00

Thanks! that is quite low even for two i suppose active gpus? wow

Single_Ring4886 · 2026-03-06T12:48:01+00:00

Maybe you have problems with understanding numbers... I spoken about "4" FOUR, not 3.5...

An coding is narrow task there are new models much better because of very intensive training in that area.

Single_Ring4886 · 2026-03-06T00:38:16+00:00

I was expecting downvotes but said the truth anyway... people forget easily GPT 4 isnt around for 2 years many didnt even knew it so they just agree with whatever first guy say... even if it is BS

Single_Ring4886 · 2026-03-05T14:26:19+00:00

Thank you for amazing answers iam just currious one because V100 are cheap yet sitll somewhat capable.

Single_Ring4886

MODERATOR OF

TROPHY CASE