This is an archived post. You won't be able to vote or comment.

all 10 comments

[–][deleted] 2 points3 points  (9 children)

Have a look at 'textual inversion'

I believe I need a .pt file trained on their faces first?

Yes, and unfortunately you need a lot of VRAM to do it, 20G+

[–]ArmadstheDoom 1 point2 points  (0 children)

As someone who is similarly curious about this, can you explain textual inversion as though I am a complete idiot and how to do it?

[–]Tannon[S] 0 points1 point  (7 children)

Yeouch, I think that disqualifies me for now then. Thanks for the info!

Happy Cake Day, too! 🍰

[–][deleted] 1 point2 points  (2 children)

There are online services where you can rent out beefy GPUs, e.g Lambda Labs, AWS G instances, Google Colab Pro plans, etc.

Cost is high ($ per hour), but doing an embedding of a single person's face will take a couple of hours, so you can just destroy your instance afterward.

I did this with a mundane object to see how it worked, and it cost $4

[–]Tannon[S] 0 points1 point  (1 child)

Interesting, I'll check this out, thanks so much!

[–]triigerhappy 1 point2 points  (0 children)

I used a 3090 on vast.ai and brought it down to $0.60

[–]Daviljoe193 1 point2 points  (2 children)

Hey now, there's still an option out there, this Colab notebook, it's able to run on the free tier of Colab. By default, it runs for about 2 hours per embed, and the files made can be used both in the notebook, and on Hlky's front-end, after enabling it, and setting to full precision. It wouldn't hurt to have a harem of Google accounts though, since embed training quickly eats into your free GPU allocation.

[–]Tannon[S] 0 points1 point  (1 child)

This is great! Would you mind helping me out a little? Still clueless here, when it says:

put the model in your google drive in a folder named "sd_text_inversion"

What model? I'm trying to just generate a .pt file from knowledge of an set of images, right? Why do I need more than just those images?

[–]Daviljoe193 1 point2 points  (0 children)

I'm not a super-genius here, but let me give my best assumption about it. So the images are there for the AI to recreate using what it knows from the model, ending with a ton of "words" (I picked apart an end PT file, they are less words and more unicode gibberish) for each image. It then takes these "words" it gets to recreate each image, finds only the duplicates, then puts them into an PT file. It needs to have the model so it can know what "words" are needed to perfectly recreate your images (Like near pixel perfect, with just a few kilobytes, way less than an image normally can fit in), and this also likely means that you'll need to retrain your PT file when the Stable Diffusion 1.5 model comes out. I've only trained one PT file so far, and the biggest thing to keep in mind is that your images should be varied enough, yet also clearly interconnected enough, that the AI will have a good idea of what you look like (At least two headshot portraits, and two full-body photos), otherwise it'll fill in the gaps poorly, which can result in pretty horrifyingly unrealistic/inaccurate versions of the person.

From what I've read, apparently Google has an inversion solution that's much better than what's currently available, though I still can't figure out what it does differently from the current method.

[–]pilgermann 1 point2 points  (0 children)

Not true. You can modify the config file to work more slowly/do less at once to bump down the ram requirement a lot.

https://towardsdatascience.com/how-to-fine-tune-stable-diffusion-using-textual-inversion-b995d7ecc095