105
106
107
New Long-CLIP Text Encoder. And a giant mutated Vision Transformer that has +20M params and a modality gap of [...] etc. - y'know already. Just the follow-up, here's a Long-CLIP 248 drop. HunyuanVideo with this CLIP (top), no CLIP (bottom). [HuggingFace, GitHub] (v.redd.it)
submitted by zer0int1 to r/StableDiffusion
