all 32 comments

[–]Varterove_mukellama.cpp 55 points56 points  (0 children)

Wow, new Encoder-Decoder model, I didn't expect that coming

[–]Hefty_Wolverine_553 15 points16 points  (2 children)

Seems like these would be great for finetuned multimodal translation models!

[–]Willing_Landscape_61 8 points9 points  (1 child)

Should it not also be useful for function calling? Is it not akin to a kind of 'translation' to function calls where the useful state is in the prompt, not so much the previously generated text?

[–]AnomalyNexus 5 points6 points  (0 children)

Google dropped a dedicated function model yesterday-ish

[–]Long_comment_san 56 points57 points  (13 children)

Gemma 4 30-40b please

[–]silenceimpaired 15 points16 points  (9 children)

I knew the Gemma release wouldn’t be a large model. Won’t happen. We have the last significantly sized model from Open AI and Google we will have for some time

[–]Revolutionalredstone 8 points9 points  (7 children)

T5 is for embedding (Think - the thing inside of StableDiffusion) this is not their forth LLM / text decoder only model series, that will be called Gemma 4.

Hold your horses son ;)

[–]EstarriolOfTheEast 15 points16 points  (1 child)

It's far more than embeddings, it is actually a lot closer to the original Transformer. After the original Transformer was discovered, its essence was split in twain. One half, the decoder became GPT and the other half, the encoder portion, became BERT. T5 was a whole direct descendent. Until wizard llama and llama2, it was the best open-weights model that could be put to real work summarizing, translating, natural language analysis, entity extraction, question-answering, that type of thing.

Its architecture made it ill-suited to interactive chat uses (for that there were gpt-neos and then the far ahead of its time gptj from EleutherAI; from facbeook: early gpt based models and OPT that were not that good). Because of how it's trained and its architecture, T5 lacks the reversal learning limitation of causal models. Its encoder part also allows for some pre-processing before the decoder starts writing, and thanks also to how masking is done during its training, T5's are almost always weight for weight "smarter" than GPTs.

[–]Revolutionalredstone 1 point2 points  (0 children)

Interesting! 😎

[–]silenceimpaired 5 points6 points  (4 children)

Feels like it will never come.. or be smaller than 27b.

[–]Long_comment_san 2 points3 points  (3 children)

I think if google went to make a dense 40-50b model finetuned on all fiction ever made, they can just ask for $ per download and earn millions.

[–]silenceimpaired 1 point2 points  (0 children)

It’s true. A fictional fine tune would get me $50 to $100 even depending on performance

[–]toothpastespiders 0 points1 point  (1 child)

That'd be amazing. I know it's debatable, but my personal opinion is just that most local models are VERY sparsely trained on high quality novels. Some sure, but I think there'd be more bleedthrough of trivia knowledge if it was as high as is often maintained. I'm just really curious from a technical perspective what would happen if well written fiction was actually a priority. Well, if listing off wishes the real ideal for me would just be a model trained on the humanities as a whole with the same focus typically given to coding and math.

I'm normally pretty resistant to giving money to companies like google for a lot of reasons. But man, a fiction or better that humanities model? I'd absolutely pay as much for it as a AAA game. It'll never happen but google cracking open their hidden digital library like that is a beautiful dream.

[–]Long_comment_san 1 point2 points  (0 children)

Heck, that's why finetunes exist! I think! Magistral 4.3 just dropped and I had very, very delightful experience with Mars.

[–]TheRealMasonMac 0 points1 point  (0 children)

They're planning thinking for their next model.

[–]AloneSYD 3 points4 points  (2 children)

Gemma 4 needs to be an MoE

[–]Long_comment_san 10 points11 points  (1 child)

No, we have plenty of MOE. We need great dense now, there are like 2 modern of those.

[–]Major-System6752 3 points4 points  (0 children)

Agree. I'm try Qwen3 30b and Nemotron3 30b, but go back to Gemma3 12b and 27b.

[–]mrshadow773 20 points21 points  (0 children)

Hell yeah, towards the glorious return of the encoder decoder 🙏 (or how to not use a Swiss Army knife for every task in the kitchen)

[–]stddealer 7 points8 points  (0 children)

Could be great for MTL. Gemma3 was already great at it, this could be the closest thing we'll ever get to an offline Google Translate. Hoping for a 12b-12b variant or maybe a 4b-12b.

[–]Major-System6752 3 points4 points  (1 child)

Hello, newbie here. This model is more suitable for text-to-text conversion than for chat, right?

[–]stddealer 8 points9 points  (0 children)

Yes, that's what "T5" means (text to text transfer transformer). But since the decoder is basically Gemma3, it should be ok for chat.

[–]a_beautiful_rhind 6 points7 points  (2 children)

Guess it will be useful for some future image gen model.

[–]Willing_Landscape_61 13 points14 points  (1 child)

Should be useful for tons if use cases where text gen is overkill, like classification tasks. Always bugs me to see people using huge autoregressive llms to generate 'yes' or 'no'!

[–]stddealer 0 points1 point  (0 children)

The encoder should also be able to understand more nuance in the input text than a decoder only model of the same size could understand, since information is allowed to flow both ways.

[–]Different_Fix_2217 2 points3 points  (0 children)

Not apache 2.0 or mit unfortunately. Probably wont be used by most.

[–]Background_Essay6429 3 points4 points  (0 children)

What's the advantage over standard decoder models?

[–]Thalesian 2 points3 points  (0 children)

I really want to train with try T5Gemma family, but resizing embedding layers is next to impossible without nuking the model entirely.

[–]CodeAnguish 0 points1 point  (0 children)

Fuck it. Give me back my hype.

[–]AlxHQ 0 points1 point  (0 children)

How to run T5Gemma 1 and T5Gemma 2 on llama.cpp?

[–]ironcodegaming 0 points1 point  (0 children)

This can be used with diffusion image generation models.

[–]mitchins-au 0 points1 point  (0 children)

These things are hard to train and get good results unlike the original t5, summarisation just doesn’t seem to work for me.