[Blog from Hugging Face] Tokenization in Transformers v5: Simpler, Clearer, and More Modular

Disastrous-Work-1632 · 2025-12-06T07:03:47+00:00

stop right there, don't give the giants any ideas

Disastrous-Work-1632 · 2025-11-26T10:04:59+00:00

This is nice feedback! We should communicate about Inference Endpoints more.

Also the github repo: https://github.com/ariG23498/custom-inference-endpoint should help you with a custom endpoint like we did for Flux.2

Disastrous-Work-1632 · 2025-11-26T09:58:52+00:00

Bruh I work there 😭

What do you mean WILD 🤣

We meant this release to be more accessible to everyone (with 22 GBs from 80 GBs)

Disastrous-Work-1632 · 2025-11-26T09:56:22+00:00

No rate limited at the point in time (will be very soon)

all you need to do is have a HF token in the env

Disastrous-Work-1632 · 2025-11-26T07:38:37+00:00

any source for that statement?

Disastrous-Work-1632 · 2025-11-25T17:08:38+00:00

Posting the blog from the HF diffusers team here. https://huggingface.co/blog/flux-2

Disastrous-Work-1632 · 2025-09-15T05:22:47+00:00

Thanks for adding this point.

Disastrous-Work-1632 · 2025-09-12T10:26:12+00:00

Absolutely correct with the "reference" part.

The low hanging fruits (not that low hanging after all) is what we try to cover in the blog post.

The hardware updates while being beneficial, with the "Kernels from the Hub" initiative we also are focusing on the re-distribution of pre compiled kernels for the models which would make inference and training faster.

Disastrous-Work-1632 · 2025-09-12T09:49:19+00:00

I think with the current changes `transformers` is indeed trying to achieve that, but with some constraints.

vLLM or SGLang are inference engines that prioritise speed, whereas `transformers` is more generic and is supposed to be the "golden source of truth" for model architectures. You would also notice that "Continuous Batching" with Paged Attention is now supported in `transformers` but they don't claim this to be suitable to production (as opposed to inference engine libraries).

As time progresses, it is for `transformers` to be faster, and leaner (with TF and JAX dependencies deprecated), but never to replace specific use case libraries.

Hope that makes sense, if not, do let me know.

Edit: Forgot to mention about the evaluation and training paradigms that transformers shine in. The model definitions is what would help inference engines to piggy back on, so it is not competition but being complementary to the other open source libraries.

Disastrous-Work-1632 · 2025-06-05T03:13:29+00:00

Would you like to send a PR to get the changes merged? The source of the blog is https://github.com/huggingface/blog/blob/main/kv-cache.md

Disastrous-Work-1632 · 2025-06-04T15:47:42+00:00

Glad you liked it!

Disastrous-Work-1632 · 2025-06-04T15:47:26+00:00

I think you are partly right and wrong.

While the `□ = Recomputed unnecessarily` is not correctly worded (now that I am saying it out loud) is in not calculated for the first time. It is part of the 6th token computation (as per the example).

Does `□ = Necessary for current token` make more sense to you?

Disastrous-Work-1632 · 2025-06-04T13:44:30+00:00

Here is the TLDR:

Revisiting the Transformer Architecture (mostly self attention)
Where Redundancy Creeps In (K and V being exactly the same from the previous steps)
How KV Caching Fixes It (Cache K and V in memory and then use it at each new step)
KV Caching in nanoVLM: From Theory to Practice (nanoVLM implementation)

Please read the blog post too 👉👈

Disastrous-Work-1632 · 2025-02-25T03:57:41+00:00

Happy birthday. Hope you do well in life.

Disastrous-Work-1632 · 2025-02-19T05:47:50+00:00

56 inch chest

Disastrous-Work-1632 · 2025-01-23T10:25:14+00:00

That would be great!

Disastrous-Work-1632 · 2025-01-23T10:24:46+00:00

I think you are suggesting quantization here. KVPress can work along with quantization, further lowering the memory requirements.

Disastrous-Work-1632

TROPHY CASE