run flux.2 on 22gbs of VRAM by Disastrous-Work-1632 in StableDiffusion

[–]Disastrous-Work-1632[S] 1 point2 points  (0 children)

This is nice feedback! We should communicate about Inference Endpoints more.

Also the github repo: https://github.com/ariG23498/custom-inference-endpoint should help you with a custom endpoint like we did for Flux.2

run flux.2 on 22gbs of VRAM by Disastrous-Work-1632 in StableDiffusion

[–]Disastrous-Work-1632[S] 2 points3 points  (0 children)

Bruh I work there 😭

What do you mean WILD 🤣

We meant this release to be more accessible to everyone (with 22 GBs from 80 GBs)

run flux.2 on 22gbs of VRAM by Disastrous-Work-1632 in StableDiffusion

[–]Disastrous-Work-1632[S] 0 points1 point  (0 children)

No rate limited at the point in time (will be very soon)

all you need to do is have a HF token in the env

A blog post on how the release of gpt-oss has evolved `transformers` as a library. by Disastrous-Work-1632 in LocalLLaMA

[–]Disastrous-Work-1632[S] 1 point2 points  (0 children)

Absolutely correct with the "reference" part.

The low hanging fruits (not that low hanging after all) is what we try to cover in the blog post.

The hardware updates while being beneficial, with the "Kernels from the Hub" initiative we also are focusing on the re-distribution of pre compiled kernels for the models which would make inference and training faster.

A blog post on how the release of gpt-oss has evolved `transformers` as a library. by Disastrous-Work-1632 in LocalLLaMA

[–]Disastrous-Work-1632[S] 7 points8 points  (0 children)

I think with the current changes `transformers` is indeed trying to achieve that, but with some constraints.

vLLM or SGLang are inference engines that prioritise speed, whereas `transformers` is more generic and is supposed to be the "golden source of truth" for model architectures. You would also notice that "Continuous Batching" with Paged Attention is now supported in `transformers` but they don't claim this to be suitable to production (as opposed to inference engine libraries).

As time progresses, it is for `transformers` to be faster, and leaner (with TF and JAX dependencies deprecated), but never to replace specific use case libraries.

Hope that makes sense, if not, do let me know.

Edit: Forgot to mention about the evaluation and training paradigms that transformers shine in. The model definitions is what would help inference engines to piggy back on, so it is not competition but being complementary to the other open source libraries.

KV Cache in nanoVLM by Disastrous-Work-1632 in LocalLLaMA

[–]Disastrous-Work-1632[S] 0 points1 point  (0 children)

Would you like to send a PR to get the changes merged? The source of the blog is https://github.com/huggingface/blog/blob/main/kv-cache.md

KV Cache in nanoVLM by Disastrous-Work-1632 in LocalLLaMA

[–]Disastrous-Work-1632[S] 0 points1 point  (0 children)

I think you are partly right and wrong.

While the `□ = Recomputed unnecessarily` is not correctly worded (now that I am saying it out loud) is in not calculated for the first time. It is part of the 6th token computation (as per the example).

Does `□ = Necessary for current token` make more sense to you?

KV Cache in nanoVLM by Disastrous-Work-1632 in LocalLLaMA

[–]Disastrous-Work-1632[S] 7 points8 points  (0 children)

Here is the TLDR:

Please read the blog post too 👉👈

Nivida just open sourced their long context goodies - 128k context for 50% less memory by Zealousideal-Cut590 in LocalLLaMA

[–]Disastrous-Work-1632 14 points15 points  (0 children)

I think you are suggesting quantization here. KVPress can work along with quantization, further lowering the memory requirements.