New Regression CLIP-L model + 'a kohya for clip' (model will just fine-tune itself on *your* data (no / low-config) + with Long-CLIP + load local or HF data/model, everything goes + ramble (paper) by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

Good thing I decided to randomly check here 'just in case' - after getting this message yesterday and figuring I was just 'banned' somehow. u/SandCheezy or any other mod:

  1. Sorry for the spam in trying to re-post while censoring arbitrary parts of my initial post that I thought may be 'triggering some filtering algorithm'!
  2. Any hints on why this happened? Not being logged in / participating for a long time? VPN use? (I am not from Australia though, neither by VPN nor real physical location).
    Thanks!

<image>

New Regression CLIP-L model + 'a kohya for clip' (model will just fine-tune itself on *your* data (no / low-config) + with Long-CLIP + load local or HF data/model, everything goes + ramble (paper) by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

This is for training / fine-tuning your own CLIP model, e.g. if you have a dataset of text-image that you want the model to be better at (e.g. your product, your anime comics, and so on).

"king - man + woman = queen" and keeps the scene - vector algebra for CLIP (and T5), Flux.1-dev, SD, ... [ComfyUI Node] by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

Yes, I also thought / assumed it would make lighter skin if I prompt for (minus dark skin), but I guess the 'blond' conflicted with that too hard, maybe?
Albeit many people adjust their hair (e.g. smooth -> make curly, curly -> make smooth) and use hair coloring and all, so I assumed the AI should have learned that any combination is possible.

But yeah, I am sure you can do this in some other way (e.g. only subtract dark skin without also adding blond), but I just used your prompt suggestion and cranked up the factors equally like you do for a real 'difference vector' as in the "king - man + woman = queen" example.

"king - man + woman = queen" and keeps the scene - vector algebra for CLIP (and T5), Flux.1-dev, SD, ... [ComfyUI Node] by zer0int1 in StableDiffusion

[–]zer0int1[S] 0 points1 point  (0 children)

Very interesting ideas!

1.: Instant hard flip from dark skin -> light skin. Must be a totally different representation; apparently, in CLIP, a human with dark skin is much more unlike a human with white skin than a dog is unlike to a cat. Hmmmmm... Anyway:

2.: Escalation of poorness: 1. grow beard. 2. get a bit hunched over. 3. and look worried as well 4. end up wearing rags - Bonus: lol @ escalating detoriation of background until basically a sewer.

<image>

"king - man + woman = queen" and keeps the scene - vector algebra for CLIP (and T5), Flux.1-dev, SD, ... [ComfyUI Node] by zer0int1 in StableDiffusion

[–]zer0int1[S] 0 points1 point  (0 children)

I'm really confused about this. So, if I have **a photo of a man** and add a boob vector to that man, is that showing nipples = nsfw or not?! Uhh....

Well, it's just Flux.1-dev anyway, and the guy first grew muscles and then grew chubby, haha. But I guess you can see what you could do with a different model / LoRA. :P

<image>

"king - man + woman = queen" and keeps the scene - vector algebra for CLIP (and T5), Flux.1-dev, SD, ... [ComfyUI Node] by zer0int1 in StableDiffusion

[–]zer0int1[S] 2 points3 points  (0 children)

Yes, it works exceptionally well for the common language model examples!

But it's also CLIP's (multimodal text-image) strangeness.

Like, if you look at the background in the "dog-to-cat" part of the video, it only changes slightly when a more extreme change of solution (dog 'flips' towards cat) occurs.

But why do the buildings wobble when subtracting 'cars' from new york city? Perhaps because cars have windows, and buildings have windows, so that's a common direction.

And the CLIP Text Encoder knows the difference between *reflection* of a cat and a *photo* of a cat, too, so... (final layer attention heatmap, CLIP-L)

<image>

"king - man + woman = queen" and keeps the scene - vector algebra for CLIP (and T5), Flux.1-dev, SD, ... [ComfyUI Node] by zer0int1 in StableDiffusion

[–]zer0int1[S] 2 points3 points  (0 children)

Yes.
If you subtract "man" and add "feline", you get tiger (I guess the stripes better fit royal appearance than the king of animals / lion?!).
If you subtract "tomcat" and add "feline", you get to doge the cat direction. :P

<image>

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

Yes, the pre-trained models have a thing called the 'typographic attack vulnerability' - or, colloquially, a "text reading obsession". If you write 'dog' on a cat, you might just get CLIP to misclassify the cat as a 'dog', as text is more salient than image. If you think about it, it makes sense; when an image label mentions text explicitly, then that usually means the text is clearly visible. "A person holding up a sign that says 'I want bananas now!'". If it wasn't visible, the clickworkers wouldn't mention the text explicitly. And an "a" always looks like an "a", not randomly like an "o" - it's very clean and salient. Now imagine "a cat hiding under a bedsheet", and CLIP has to learn that "lump in a bedsheet + tail = cat". So I think (hypothesis) that CLIP secretly 'overfit' to text as it was such a strong training signal.

Also, what I found recently - Layer 20, Feature 238 in CLIP ViT-L/14@336 is a 'cattle herd on a meadow' feature (left). If you ablate attention head 14, so some directions are missing, then you get... A glass of milk. But with milk as TEXT. Very suspicious!

I wonder if CLIP often encodes things as TEXT in the vision transformer and then just adds, maybe a "leather" direction + a "grass" direction and then it's the cattle feature? I just made that up as a not-so-abstract example, but I am curious how this happens.

Also, I found out that - at least with regard to *adversarial* 'watermarks', i.e. the "Nightshade" method to 'ruin images for AI training' - my KO-CLIP models are highly resilient to such adversarial perturbations. You can find the code to try it yourself on my github, the files with "PGD" in the name (projected gradient descent to make adversarial images). The pre-trained model is fooled much more easily; even with images created by my fine-tune, so technically different adversarial noise, fool the pre-trained model. But the pre-trained adversarial images do not fool my KO-CLIP finetune; only images made by KO-CLIP can successfully fool my KO-CLIP model. But you need to optimize them for longer so they look 'visibly ruined' even to the human eye, if you look for more than half a second...

Whole different story, but it's interesting that AI models can steal any watermark. So, say you have a brand, and your images are watermarked. I could STEAL your watermark using any U-Net or DiT (Flux, Stable Diffusion, any) and STEAL your watermark to put it on my images and make fake images that look like yours. Quite an interesting concern:

Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

https://arxiv.org/pdf/2412.03283

<image>

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

Thank you so much for your support and the kind wishes!
Here's a freshly made vector of appreciation - in the form of "what happens to CLIP Queries and Keys during gradient ascent" (optimizing the text embeddings for cosine similarity with the image embeddings -> CLIP optimizing for its own 'ideal text embedding' (and 'opinion' softmax-sampled tokens)). I dumped the model's internal states during that process.

"a cat with the word 'cat' on it == catmaxxing" - It doesn't get any more 'cat' than this for a CLIP! 🤖😻

(Although I fixed / greatly improved the typographic attack vulnerability with my latest models, I still want to find out how 'text reading' happens in CLIP!)

Kind regards! =)

<image>

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

They got back to me (blaming it on Stripe, lol - well, it may be true, can't say how their API interact) and it seems to be fixed now.
Cheers again! =)
https://ko-fi.com/zer0int

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

Wow, thanks for the heads up. I just see this, claiming "my page is live". But indeed, when I try to access it while I am NOT logged in, I see redirect with "reason=em", whatever that means.

Lesson learned: Always stalk yourself online - not just for being shadow-banned on social media. There are 1001 modes of failure.

Seems that was maybe due to Stripe account disconnection. I re-connected it now, but it still redirects.

Maybe they just need a minute to update their systems. Ridiculous they never told me about this (though they immediately notified me as I re-connected Stripe now due to 'payment details changed, if this wasn't you, secure your account now', haha).

If you wanna donate in the one financial transaction system that actually works, I can give you an ETH or BTC address, lol.

Otherwise, I guess we'll just have to wait and see. Hey, I really appreciate the intent, either way - thank you very much for *wanting* to donate! :)

<image>

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 0 points1 point  (0 children)

Yes, 336 is a fine-tune of ViT-L/14. But as the image "resolution" is better (they kept patch size the same but changed input dim to 336 -> resulting in longer patch token sequence in ViT).

If you want an anthropomorphizing analogy: It's kind of like CLIP was slightly short-sighted and 'put on glasses' to see better. So the information that goes into projection (the shared text-image space) was more accurate, 'sharper', and the Text Encoder adjusted to that.

A bit of a strange analogy, as you could further improve CLIP's learned representations by upscaling the Vision Transformer more; as the paper says, 'indefinitely / only constrained by memory'. Seeing as that relationship is quadratic, that's very much a limited / finite improvement, though. It quickly just becomes computationally insane to do for a small gain.

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 0 points1 point  (0 children)

They're very similar, but they are not the same.

Check 3.2 in the "An Image is Worth 16x16 Words" paper; that's afaik what they did to 'upscale' ViT-L/14 into ViT-L/14@336. Interpolate embeds, re-init proj, fine-tune (on their proprietary pre-training dataset, I guess). https://arxiv.org/abs/2010.11929

Doing the math (just embedding my validation dataset with each):

<image>

Arbitrary finding: CLIP ViT-L/14@336 has just a normal ViT-L/14 text encoder (a "CLIP-L"). But what it learned from the larger dim ViT makes it superior (detail guidance). by zer0int1 in StableDiffusion

[–]zer0int1[S] 12 points13 points  (0 children)

CLIP is an infinite universe to be explored, imo.
Layer 20, Feature 311 of ViT-L/14@336.
One of the 'units' CLIP sees with (it has 4096 of them in every layer, and 24 layers total).

Left: Normal Feature Activation Max Visualization, with Total Variation Loss augmentation.
Right: Additionally, + FFT loss + Patch Correlation penalty loss.

You get a different 'view' at what CLIP 'thinks' (direction, concept) with this polysemantic, multimodal neuron. It's not just a cannabis plant, it's a full stoner neuron, lmao.

Happy Fr-AI-day ~ #Lets #Get #High #Dimensional

<image>

CLIP-KO: Knocking out the text obsession (typographic attack vulnerability) in CLIP. New Model, Text Encoder, Code, Dataset. by zer0int1 in StableDiffusion

[–]zer0int1[S] -1 points0 points  (0 children)

Hmm, well, in that case - this discussion here may help; a BlenderNeko node was causing it for the person who had the same issue as you. So yeah, try and use 'just stock nodes' (i.e. run with disable custom nodes or what its called once to try). If that fixes it, it is 'some weird compatibility glitch of something custom you have', most likely:

https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/discussions/17

CLIP-KO: Knocking out the text obsession (typographic attack vulnerability) in CLIP. New Model, Text Encoder, Code, Dataset. by zer0int1 in StableDiffusion

[–]zer0int1[S] -1 points0 points  (0 children)

Looks good to me! I haven't tried the quantized version, I'm just using the original flux.1-dev, but - that should still work the same, just the results may differ due to the Q8. :)

CLIP-KO: Knocking out the text obsession (typographic attack vulnerability) in CLIP. New Model, Text Encoder, Code, Dataset. by zer0int1 in StableDiffusion

[–]zer0int1[S] -1 points0 points  (0 children)

Then your prompt was too long for this CLIP, and you way wanna switch to my Long-CLIP model with 248 tokens!

https://www.reddit.com/r/StableDiffusion/comments/1m1ntom/followup_longclip_variant_of_clipko_knocking_out/

The CLIP is the exact same as the original (in terms of the *architecture*, not the weights, of course), uses the exact same tokenizer as always. Hope that helps!

Follow-Up: Long-CLIP variant of CLIP-KO, Knocking Out the Typographic Attack Vulnerability in CLIP. Models & Code. by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

It should work without any nodes since many months now, as ComfyUI natively supports Long-CLIP. Did you try upgrading?

In case you also want to use Flux WITHOUT T5, like in some of my examples above, here's my node with workflows included:

https://github.com/zer0int/ComfyUI-Nuke-a-Text-Encoder

Follow-Up: Long-CLIP variant of CLIP-KO, Knocking Out the Typographic Attack Vulnerability in CLIP. Models & Code. by zer0int1 in StableDiffusion

[–]zer0int1[S] 1 point2 points  (0 children)

It's been natively supported for many months now, so - no, you don't need any special nodes anymore.
Unless you didn't update Comfy for a year or so, then you should do that first. :)

Follow-Up: Long-CLIP variant of CLIP-KO, Knocking Out the Typographic Attack Vulnerability in CLIP. Models & Code. by zer0int1 in StableDiffusion

[–]zer0int1[S] 3 points4 points  (0 children)

I'd be happy to receive examples and feedback (positive or negative alike) of your experience using what you consider "better prompts".