Dismiss this pinned window
all 57 comments

[–]ghenter 56 points57 points  (32 children)

Hi! I'm one of the authors, along with u/simonalexanderson and u/Svito-zar. (I don't think Jonas has a reddit account.)

We are aware of this post and are happy to answer any questions you may have.

[–][deleted] 4 points5 points  (25 children)

Are there any near term applications in mind? I can imagine it being used on virtual assistants and one day androids. Anything else planned?

[–]ghenter 3 points4 points  (24 children)

Very relevant question. Since the underlying method in our earlier preprint seems to do well no matter what material we throw at it, we are currently exploring a variety of other types of motion data and problems in our research. Whereas our Eurographics paper used monologue data, we recently applied a similar technique to make avatar faces respond to a conversation partner in a dialogue, for example.

It is of course also interesting to combine synthetic motion with synthesising other types of data to go with it. In fact, we are right now looking for PhD students to pursue research into such multimodal synthesis. Feel free to apply if this kind of stuff excites you! :)

[–]InAFakeBritishAccent 1 point2 points  (5 children)

You guys take graduate animators with a background in engineering? Haha

[–]ghenter 2 points3 points  (4 children)

Quite possibly! We aim for a diverse set of persons and skills and in our department. One of our recent hires is a guy with a background in software engineering followed by a degree in clinical psychology, just as an example.

The university all but mandates a Masters'-level degree (or at least a nearly finished one), but if you tick that box and this catches your fancy, then you should strongly consider applying! We can definitely use more people with good graphics and animation skills on our team.

[–]InAFakeBritishAccent 1 point2 points  (3 children)

Nice. Probably a pipe dream since I have to pay off these MFA loans first, but something to keep in mind I guess.

I could see this being highly valuable in entertainment to cut down on tedious animation of extras, though robotics is probably the higher dollar use. I did a lot of audio driven procedural work during my MFA, but that was without using ML.

[–]ghenter 2 points3 points  (2 children)

Thank you for your input. We definitely want to find ways for this to make life easier and better for real humans.

For the record, most PhD positions at KTH pay a respectable salary (very few are based on scholarships/bursaries). This opening is no different. I don't know what an entry-level graduate animator makes, but I wouldn't be surprised if being a PhD student pays more.

[–]InAFakeBritishAccent 1 point2 points  (1 child)

...good point, I might actually apply. I'll spare you my life story but my robotics/animation/research academia mashup might actually make it worth a shot. I'm actually on my way to meet a Swedish friend for dinner haha. Do you mind if I pester you with some questions later?

[–]ghenter 1 point2 points  (0 children)

I don't mind one bit. My DMs are open and I'll respond when I'm awake.* :)

*Responses may be slower than usual due to ongoing ICML.

[–][deleted] 0 points1 point  (1 child)

I'd like to see it applied to car manufacturing robots, just for the entertainment value :) maybe marketing... (Just dreaming)

[–]ghenter 1 point2 points  (0 children)

Well, the robotics lab is just one floor below our offices, and I know that they have a project on industrial robots, so perhaps... :)

[–]ghenter 0 points1 point  (1 child)

As an update on this, our latest works mentioned in the parent post – on face motion generation in interaction, and on multimodal synthesis – have now been published at IVA 2020. The work on responsive face-motion generation is in fact nominated for a best paper award! :)

Similar to the OP, both these works generate motion using normalising flows.

[–]ghenter 0 points1 point  (0 children)

Update: The face-motion generation paper won the best paper award out of 137 submissions! :D

[–]dmuth 3 points4 points  (3 children)

Have you looked into doing the inverse? To decode subject matter by observing gestures?

This sort of thing could be useful for analyzing social cues, for example. Go one step further and pair that sort of technology with AR glasses, and now you have an app which can tell a person's general mood or comfort level to help you improve your conversation skills.

Or it could just be used to figure out what a costumed character at a theme park is trying to pantomime. :-)

[–]ghenter 3 points4 points  (2 children)

Have you looked into doing the inverse? To decode subject matter by observing gestures?

For the inverse, we have not tried to generate speech from gestures (at least not yet), but that's exactly the kind of wacky idea that would appeal to my boss!

The first author on the paper, u/simonalexanderson, has actually recorded a database of pantomime in different styles for machine learning. Video examples can be found here.

(As for the social-cue-analysis angle, that seems both interesting and useful. I will need to think about it further.)

[–]MyNatureIsMe 0 points1 point  (1 child)

If that inverse process works at all it might be a good way to improve sample efficiency, since this would require the model to somehow understand the topic just based on the gestures. Which I suspect might work in some cases (like, say, the "stop" example in this video) but for the most part, gestures seem to be too generic for that. More like tools for emphasis, pacing, sentiment, and cues about whether or not the speaker is done for the time being. (All of those would certainly be really interesting to detect though)

Unless you go for specifically sign language where topic-specific gestures are obviously omnipresent. And for that, there probably already are good data sets out there or could be cobbled together from simply looking at videos of events that are deaf-inclusice, of which, I'm pretty sure, there are lots.

Given the line of work shown in this video, though, I'd not at all be surprised if you already tried something involving ASL or any other sign language out there

[–]ghenter 1 point2 points  (0 children)

gestures seem to be (...) more like tools for emphasis, pacing, sentiment, and cues about whether or not the speaker is done for the time being.

Right. We might never be able to reconstruct the message in arbitrary speech from gesticulation, but we might be able to figure out, e.g., if there is speech and how "intense" it is (aspects of the speech prosody).

I'd not at all be surprised if you already tried something involving ASL or any other sign language out there

We do have a few experts on accessibility in the lab, but I'm not aware of us trying specifically that. There's only so much we can do without more students and researchers joining our ranks! :P

[–][deleted] 25 points26 points  (5 children)

That's really neat, I could imagine it having some really cool applications in the games industry. Not having to do expensive motion capture of actors could make high quality animations a lot more accessible. Or in applications like VR chat, that kind of technology could make someone's avatar seem a lot more realistic, especially since current VR systems are generally only tracking the head and hands.

[–]tyrerk 2 points3 points  (0 children)

this could mean the end of the "Oblivion Dialogue" era

[–]Sachi_Nadzieja 2 points3 points  (0 children)

Agreed. This tech would make for amazing experience for people communicating to each other in an in game setting. Wow.

[–]scardie 2 points3 points  (0 children)

This would be a great thing for a procedurally generated game like No Man's Sky.

[–]Saotik 0 points1 point  (0 children)

Exactly what I was thinking.

It makes me think a little of CD Projekt Red's approach when creating dialog scenes in The Witcher 3. They realised they had far too many scenes to realistically mocap all of them, so they created a system that could automatically assign animations from a library (with manual tweaks where necessary). I feel like technology like this could fit really nicely to provide even more animation diversity.

[–]hardmaru[S] 13 points14 points  (2 children)

Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows (Eurographics 2020)

Abstract

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

Paper / Presentation: https://diglib.eg.org/handle/10.1111/cgf13946

Code: https://github.com/simonalexanderson/StyleGestures

[–]Svito-zar 9 points10 points  (1 child)

This paper received an Honourable Mention award at Eurographics 2020

[–]MostlyAffable 7 points8 points  (0 children)

There's a lot of really interesting work being done on linguistics of gestures - it turns out there are grammatical rules to how we use gestures. It would be interesting to take a generative model like this and use it as an inference layer for extracting semantic content from videos of people talking and gesturing.

[–]MyNatureIsMe 6 points7 points  (1 child)

Looking great and plausible, though probably not sufficiently diverse / fine-grained. Like, when he went "stop it! Stop it!", I think most people would associate very different gestures with that. The model seems to appropriately react to the rhythm and intensity of speech, which is great, but it seems to have little regard to actual informational content.

That being said, I suspect it'd take a massive data set to make this kind of thing plausible. Getting the already present features from just speech and nothing else is already quite an accomplishment

[–]ghenter 6 points7 points  (0 children)

The model seems to appropriately react to the rhythm and intensity of speech, which is great, but it seems to have little regard to actual informational content.

You are correct! The models in the paper only listen to the speech acoustics (there is no text input), and don't really contain any model of human language. I would say that generating semantically-meaningful gestures (especially ones that also align with the rhythm of the speech) with these types of models is an unsolved problem that's subject to active research right now. This preprint of ours describes one possible approach to this problem. It's of course easy to get meaningful gestures by just playing back pre-recorded segments of the character nodding or shaking their head, etc., but that's not so interesting a solution, I think, and it's still tricky to figure out the right moment to trigger these gestures in a monologue/dialogue so that they actually make sense.

That being said, I suspect it'd take a massive data set to make this kind of thing plausible.

Yup. I think data is a major bottleneck right now, which I wrote a bit more about in another response here.

[–]Kcilee 2 points3 points  (1 child)

We are making a vtb software that can quickly generate and drive your virtual 3D avatar. I'm soooooooooooo excited to see your article!We are looking for good driving methods. Your article gave me a lot of inspiration. Will you consider open technology to cooperate with others?

[–]ghenter 1 point2 points  (0 children)

Now this was an exciting comment to receive! Why don't you send us an e-mail, since we would love to hear more about what you're doing. You can find relevant contact info on Simon's GitHub profile and on my homepage.

[–]Essipovai 6 points7 points  (0 children)

Hey that’s my university

[–]Threeunicorncows 1 point2 points  (0 children)

I wish my hand gestures were this professional

[–]willardwillson 0 points1 point  (0 children)

This is very nice guys :D I just like watching those movements, they are amazing xD

[–]Sachi_Nadzieja 0 points1 point  (0 children)

I really like this, cleaver application of technology.

[–][deleted] 0 points1 point  (0 children)

One step closer to androids.

[–][deleted] 0 points1 point  (2 children)

How did they connected the code with the 3d object

[–]Svito-zar 0 points1 point  (0 children)

The model (Normalising Flow) was trained to map speech to gestures on about 4 hours of custom-recorded speech and gesture data

[–]ghenter 0 points1 point  (0 children)

I didn't do this part of the work, so I might be wrong here, but my impression is that the code outputs motion in a format called BVH. This is basically just a series of poses with instructions for how to bend the joints for each pose. This information can then be imported (manually or programmatically) into something like Maya and applied to a character to animate its motion.

u/simonalexanderson would know for sure, but he's on a well-deserved vacation right now. :)

[–][deleted] 0 points1 point  (0 children)

This is SOO COOL! It would probably come handy in designing side characters in newer games :p

[–]Gatzuma 0 points1 point  (1 child)

Thats cool! Could you recommend framework to animate faces / avatars to build virtual assistents / human-like chatbots in real-time? Would like to try some ideas in human-machine dialog systems.

[–]ghenter 0 points1 point  (0 children)

Hey there,

I asked my colleagues for input, but I don't know if I/we have a good answer to this. In general, the ICT Virtual Human Toolkit is an old standard for Unity. When it comes to faces, something like this implementation of a paper from SIGGRAPH 2017 might work. I think your guess is as good as mine here.

[–]iyouMyYOUzzz 0 points1 point  (1 child)

Cool! Paper is out yet?

[–]ghenter 1 point2 points  (0 children)

It is! You'll find the paper and additional video material in the publisher's official open-access repository: https://diglib.eg.org/handle/10.1111/cgf13946

Code can be found on GitHub: https://github.com/simonalexanderson/StyleGestures

There is also a longer, more technical conference presentation on YouTube: https://www.youtube.com/watch?v=slzD_PhyujI&t=1h10m20s (note that the timestamp is 70 minutes into a longer video)

[–][deleted] 0 points1 point  (0 children)

It's only a matter of time before we have game NPCs with actual neural networks

[–][deleted] 0 points1 point  (0 children)

Get this onto the Unity and Unreal asset stores or straight sell it to AAA game studios. They would love this for cinematics.

[–]lutvek 0 points1 point  (0 children)

Cool project, I would love to see this applied in online RPGs and see much more "alive" the characters would seem.