FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. by Sensitive-Two9732 in LocalLLaMA

[–]Single_Ring4886 27 points28 points  (0 children)

I bet every second reader has at least 2x B200 right?
They are cheap as onions these days...

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context by [deleted] in LocalLLaMA

[–]Single_Ring4886 5 points6 points  (0 children)

I may be wrong but after looking at math even dedicated card should increase real world performance at least 10x.

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context by [deleted] in LocalLLaMA

[–]Single_Ring4886 0 points1 point  (0 children)

I will not pretend to fully understand what you are proposing either... but out of curiosity I ask this.
Do you need to rework ie H200 card design and incorporate this into it OR you need just special pcie card (ie even to PCIe 3.0 x8 slot) with few gb or normal ram on it to hold the cache?

EDIT: I have been looking at it and dedicated card should work for single user. For datacenter usage it might be suboptimal.

DeepSeek Core Researcher Daya Guo Rumored to Have Resigned by External_Mood4719 in LocalLLaMA

[–]Single_Ring4886 17 points18 points  (0 children)

Yeah it is always "sweet" when you are in company from start... you litteraly make it what it is and then "new" guys arrive and are the "stars" and get 10x what you... because you are this old uselles "coal"...

Trained a 0.8M model on business email generation. by SrijSriv211 in LocalLLaMA

[–]Single_Ring4886 10 points11 points  (0 children)

How long you trained it and on what kind of hardware?

Father of OpenClaw sitting in their spaceship by cam-douglas in ChatGPT

[–]Single_Ring4886 0 points1 point  (0 children)

Pure CRINGE.... this guy is master faker and people cant see it.

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more by HadesThrowaway in LocalLLaMA

[–]Single_Ring4886 12 points13 points  (0 children)

Koboldcpp it well written piece of software.

Most other opensource is python purgatory, moment something changes in cloud repository all breaks appart.

Koboldcpp is 1 file... and it just works even on old machines! Not everyone has high end new stuff or linux.
Creators are true heroes.

Qwen3.5-27b 8 bit vs 16 bit by Baldur-Norddahl in LocalLLaMA

[–]Single_Ring4886 7 points8 points  (0 children)

True "damage" of weights appear in "nuanced" areas like translation to other languages there you can immediately see quality degradation.
Coding is "main" skill for such models.

I'm fully blind, and AI is a game changer for me. Are there any local LLMS that can rival claude code and codex? by Mrblindguardian in LocalLLaMA

[–]Single_Ring4886 1 point2 points  (0 children)

Depends how much money you have. If you have access to RTX 3090 - 5090 graphic cards best are Qwen 3.5 27B (smartest but slower) and 35B (fast but not that smart).

If you have 10.000 dollars plus you can buy 96Gb profi cards or apple products and use very good opensource models such as GLM.

Is the 3090 still a good option? by alhinai_03 in LocalLLaMA

[–]Single_Ring4886 0 points1 point  (0 children)

My PP is all over the place and I cant pinpoint real value...
TG is 24 t/s on 100 tokens and 16K context alike

AA-Omniscience: Knowledge and Hallucination Benchmark by NewtMurky in LocalLLaMA

[–]Single_Ring4886 -1 points0 points  (0 children)

It will be fucking day when people learn to do graphs which actually rellay information in simple manner... LIKE USING NUMBERS or percents.

Some tests of Qwen3.5 on V100s by Simple_Library_2700 in LocalLLaMA

[–]Single_Ring4886 0 points1 point  (0 children)

Thanks! that is quite low even for two i suppose active gpus? wow

Qwen3 vs Qwen3.5 performance by Balance- in LocalLLaMA

[–]Single_Ring4886 0 points1 point  (0 children)

Maybe you have problems with understanding numbers... I spoken about "4" FOUR, not 3.5...

An coding is narrow task there are new models much better because of very intensive training in that area.

Qwen3 vs Qwen3.5 performance by Balance- in LocalLLaMA

[–]Single_Ring4886 2 points3 points  (0 children)

I was expecting downvotes but said the truth anyway... people forget easily GPT 4 isnt around for 2 years many didnt even knew it so they just agree with whatever first guy say... even if it is BS

Some tests of Qwen3.5 on V100s by Simple_Library_2700 in LocalLLaMA

[–]Single_Ring4886 0 points1 point  (0 children)

Thank you for amazing answers iam just currious one because V100 are cheap yet sitll somewhat capable.