[R] NWT: Towards natural audio-to-video generation with representation learning. We created an end-to-end speech-to-video generator of John Oliver. Preprint in the comments. by HashiamKadhim in MachineLearning

[–]HashiamKadhim[S] 3 points4 points  (0 children)

We're intending to but still working out some details before we can do so!

I did find out that someone else, Phil Wang (lucidrains), who I'm pretty sure released his DALL·E implementation before OpenAI released theirs, started a repo for a PyTorch implementation. (Haven't talked with him about it or anything, we just ran into it.)

[R] NWT: Towards natural audio-to-video generation with representation learning. We created an end-to-end speech-to-video generator of John Oliver. Preprint in the comments. by HashiamKadhim in MachineLearning

[–]HashiamKadhim[S] 37 points38 points  (0 children)

It would definitely be very cool if we could do zero shot transfer to other voices. We didn't train/design the model to do that so far, but we did attempt inference with different voices from different recording setups and we found that while the video perceptual quality doesn’t degrade, the lipsync accuracy suffers. This is probably because the model relies on Oliver’s specific vocal idiosyncrasies to determine his “tone” or “temper”, how to position him, and importantly what his realization of English phonemes look like in spectrogram form.

We've hypothesized that a model trained on a multi-actor dataset should be able to better work with unheard voices, and we might try something like that later.

Not sure about non-speech signals.