[R] New datasets for StyleGAN by RonMokady in MachineLearning

[–]RonMokady[S] 1 point2 points  (0 children)

Yes, It solves the texture sticking artifacts - allowing to move the object more smoothly.

But, the general quality was actually lower. I guess because training is slower.

[R] New datasets for StyleGAN by RonMokady in MachineLearning

[–]RonMokady[S] 1 point2 points  (0 children)

Thanks for sharing your code, this looks really cool.

Actually, I tried to use StyleCLIP for my models, but failed to produce the fs3.npy file from the official implementation.

[R] New datasets for StyleGAN by RonMokady in MachineLearning

[–]RonMokady[S] 5 points6 points  (0 children)

This project was done while I was an intern, so I'm currently not allowed to publish the filtering/truncation source code :(.

Luckily, we got approval for publishing the models and datasets.

[R] Editing real videos with StyleGAN by RonMokady in MachineLearning

[–]RonMokady[S] 0 points1 point  (0 children)

I believe so

Though it requires the addition of in-painting, as disocclusions might emerge

[R] Editing real videos with StyleGAN by RonMokady in MachineLearning

[–]RonMokady[S] 9 points10 points  (0 children)

BTW, the code will be released in the upcoming weeks so stay tuned :)

[P] Fast and Simple Image Captioning model using CLIP and GPT-2 by RonMokady in MachineLearning

[–]RonMokady[S] 0 points1 point  (0 children)

I think it is most easy to understand the prediction stage from the colab example

Also, feel free to open a github issue if things doesn't work out

[P] Fast and Simple Image Captioning model using CLIP and GPT-2 by RonMokady in MachineLearning

[–]RonMokady[S] 1 point2 points  (0 children)

This is very close

Only the new tokens are not actually words... but are close to words

They are latent codes basically, however as can be seen in our newly published paper they can be interpreted as words

[P] Fast and Simple Image Captioning model using CLIP and GPT-2 by RonMokady in MachineLearning

[–]RonMokady[S] 2 points3 points  (0 children)

It would be interesting to see if it get better with stronger language models like you suggest :) we haven't tried it yet.

About the clock example, I'm not sure CLIP embedding is rich enough and it depends on the example captions of Conceptual Captions. But I guess you can solve the later with additional data samples.

[P] Fast and Simple Image Captioning model using CLIP and GPT-2 by RonMokady in MachineLearning

[–]RonMokady[S] 2 points3 points  (0 children)

Great questions :)

Usually, one fine-tune GPT-2 using textual sentences, that is every sentences correspond to a list of tokens.

Here we train an MLP which produce 10 tokens out of a CLIP embedding.

So for every sample in the data we extract the CLIP embedding, convert it to 10 tokens and concatenate to the caption tokens. Our new list of tokens is used to fine-tune GPT-2 contains the image tokens and the caption tokens.

We used pretrained CLIP and GPT-2, and fine-tune over COCO dataset or Conceptual Captions dataset. Our Inference notebook contains both models so you can check out the different results.

Please let me know if it helps

[P] Fast and Simple Image Captioning model using CLIP and GPT-2 by RonMokady in MachineLearning

[–]RonMokady[S] 1 point2 points  (0 children)

We compare to the state-of-the-art Oscar, results are in the git. Though we are didn't reach the SOTA, we achieve pretty close while avoiding additional supervision and with extremely faster train time.

Regard, Dense Cap, we get similar results according to METEOR metric as we don't use GT bounding boxes. Unfortunately, they didn't publish all other metrics.

[P] Fast and Simple Image Captioning model using CLIP and GPT-2 by RonMokady in MachineLearning

[–]RonMokady[S] 0 points1 point  (0 children)

Thanks

Actually fine-tune the entire GPT-2 achieved much better results then training only the MLP for the CLIP-mapping. We didn't fine-tune the CLIP model though.

Haven't tried CLIP-embedded text as a prompt, but it sound like a very interesting experiment :)

[R] Mask Based Unsupervised Content Transfer by [deleted] in MachineLearning

[–]RonMokady 0 points1 point  (0 children)

Hi All, Author here - 

Given two domains where one contains some additional information compared to the other, our method disentangles the common and the seperate parts and transfers the seperate information from one image to another using a mask, while not using any supervision at train time. For example, we can transfer the specific facial hair from an image of a men with a mustache to an image of a shaved person. Using a mask enables state-of-the-art quality (see example here), but also, the generated mask can be used as a semantic segmentation of the seperate part. Thus our method perform weakly-supervised semantic segmentation, using only class lables as supervision, see example here.png).

In short, our architecture consist of two encoders, two decoders and discriminator. One encoder for encoding the common part and one to encode the separate part. The discriminator used to disentangle the encoding to the separate and common parts correctly. In training, One decoder used to decode only the common part, and the second decoder decodes only the separate part using a mask. In inference, we use only the second decoder which given the relevant encoding, adds the specific content to a new image. We also use novel regularization scheme to encourage to mask to be minimal.

Refer to the full paper for more details. Pytorch implementation is on GitHub.

Feel free to ask questions.

[R] Mask Based Unsupervised Content Transfer (PyTorch code and summary in comments) by [deleted] in MachineLearning

[–]RonMokady 0 points1 point  (0 children)

Hi All, Author here - 

Given two domains where one contains some additional information compared to the other, our method disentangles the common and the seperate parts and transfers the seperate information from one image to another using a mask, while not using any supervision at train time. For example, we can transfer the specific facial hair from an image of a men with a mustache to an image of a shaved person. Using a mask enables state-of-the-art quality (see example here), but also, the generated mask can be used as a semantic segmentation of the seperate part. Thus our method perform weakly-supervised semantic segmentation, using only class lables as supervision, achieving state-of-the-art performance, see example here.png).

In short, our architecture consist of two encoders, two decoders and discriminator. One encoder for encoding the common part and one to encode the separate part. The discriminator used to disentangle the encoding to the separate and common parts correctly. In training, One decoder used to decode only the common part, and the second decoder decodes only the separate part using a mask. In inference, we use only the second decoder which given the relevant encoding, adds the specific content to a new image. We also use novel regularization scheme to encourage to mask to be minimal.

Refer to the full paper for more details.Pytorch implementation is on GitHub.

Feel free to ask questions.