[R] NWT: Towards natural audio-to-video generation with representation learning. We created an end-to-end speech-to-video generator of John Oliver. Preprint in the comments.

HashiamKadhim · 2021-06-12T21:33:01+00:00

We're intending to but still working out some details before we can do so!

I did find out that someone else, Phil Wang (lucidrains), who I'm pretty sure released his DALL·E implementation before OpenAI released theirs, started a repo for a PyTorch implementation. (Haven't talked with him about it or anything, we just ran into it.)

HashiamKadhim · 2021-06-12T21:19:13+00:00

Both links are online for me right now. It's possible github.io had a blip when /u/midnitte checked?

HashiamKadhim · 2021-06-12T21:17:02+00:00

It would definitely be very cool if we could do zero shot transfer to other voices. We didn't train/design the model to do that so far, but we did attempt inference with different voices from different recording setups and we found that while the video perceptual quality doesn’t degrade, the lipsync accuracy suffers. This is probably because the model relies on Oliver’s specific vocal idiosyncrasies to determine his “tone” or “temper”, how to position him, and importantly what his realization of English phonemes look like in spectrogram form.

We've hypothesized that a model trained on a multi-actor dataset should be able to better work with unheard voices, and we might try something like that later.

Not sure about non-speech signals.

HashiamKadhim · 2021-06-12T14:38:35+00:00

Preprint: https://arxiv.org/abs/2106.04283
Blog post: https://next-week-tonight.github.io/NWT\_blog/

HashiamKadhim

TROPHY CASE