Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing (nvlabs.github.io)
126 points by kevinak on March 20, 2021 | hide | past | favorite | 29 comments


Its not exactly the same, as this is designed for compression, but there's an excerpt from Infinite Jest about the rise and fall of "videophony", wherein the users would get Video-Physiognmoic Dsyphoria which was just anxiety from having to suddenly be presentable in a previously audio-only interaction. The 'solution' if I recall correctly was the marketing of videophony specific make-up, followed by pre-make-upped latex masks, followed by beautified filters of the user, finally ending up at full 3D rendered perfect representations of the user that covered the camera and screen such that the only thing viewing the interaction was each others fake avatar.


One of Vernor Vinge's books has ultra-low-bandwidth videoconferences that are described in approximately the same way- the computer is passed the absolute minimum of data to try and generate an image that sort of approximates the interlocutor, and when the bitrate gets too low it can end up straight-up hallucinating.


"Welcome to the Slow Zone".

Fire Upon the Deep. One of my favorites.


In the movie Surrogates, people have physical robot bodies which they pilot through real-world interactions. The surrogate bodies can look like whatever you want. One poor guy has to go outside, and he gets severe social anxiety from people actually being able to look at him.


Very cool tech, especially for bandwidth reduction!

Also, I know a lot of people who're doing makeup / dressing up for video meetings. First order model (e.g. https://github.com/alew3/faceit_live3) is not good enough for those things. Maybe Nvidia's algorithm is? Wonder if there's a project allowing you to record your styled self to train a model witch you can then use to transform your "out-of-bed" natural self into the styled version, haha :D


After a couple of months of staring at each faces and probably seeing most "configurations" of ones appearance and their room, people now just don't bother to switch on cameras. Not sure if people would like to stare at fake heads when they don't particularly enjoy the real ones... Maybe it could be more fun if you could choose like fantasy attributes of your character e.g. armor, extra head, two noses etc.


you don't really need very sophisticated model for this.

just use dlib to detect face landmarks and apply some very basic hue/contrast/brightness localized filter to imitate lipstick, make-up, mascara, etc.

that's what some software can already do out of the box (e.g. zoom) and the results are very good


And having people select between scenes, as well as idling behavior when you are not in the actual frame as well


I'd go with reduced video call bandwidth, but I'd be thinking of a home assistant UI like Holly. -- https://www.google.com/search?q=red+dwarf+holly&tbm=isch


Yeah, I've been waiting for a lifelike assistant with voice recognition. But not Siri/Alexa, local only to help manage my own stuff.


Impressive results!

Personally and totally off topic... what I'd really like to see in a video synthesizer is something that takes my webcam input, detects eye position and pastes googly eyes onto my head.


You mean something like SnapCam?


Amazing, thanks!

Ideally this'd be something more open that allows me to whip up python shits acting as a filter myself, but i love this either way... :)


I'm not sure I'd use it, but that sounds amazing.


Amazing technology. It was covered by two minutes paper on yt.

https://www.youtube.com/watch?v=dVa1xRaHTA0


They should now focus on eyes, because so far it is too easy to tell which is fake. Great development nonetheless! Soon we will be able to program an expert system that will join a Zoom meeting while we will be free to do other things and then we would get meeting minutes and resolutions.


Ew. How about we leave people’s heads alone during a video call.

I think I’d much prefer a compressed view of someones actualy head and pose, to the eerie monster that is created by this alghorithm.

Of course it’s research, and not practically in use yet, but...


The transmitting side should be able to compare this version to the ground truth and if it diverges too much then it tells the receiving end to fall back to “normal” blocky compression artifacts instead of keeping the eerie one.

So long as the face-specific-compression faithfully reproduces the ground truth, it should be fine. In a way it’s similar to voice-specific compression for audio. Knowing what’s transmitted (a head) is information that should be used.

I’d love to see one of these algorithms used for other context specific areas where there is much less to “lose”: sports. Compressed streams of a grass pitch with players running after a ball has horrible compression artifacts when the camera pans at low bitrates. But the receiver should know what the pitch looks like where the camera pans - it’s static, and we had it on screen a moment ago!


If you are using a compression algorithm, you want to use optimal algorithm, right? That is performing compression nearly optimally, right? Well, this is a step in this direction.

Compression is an AI-complete problem :)


Having heads in a call at all is stupid enough


I disagree. I am often in calls with half a dozen folks from a customer, none of whom I've met before, all with English names that I have more trouble telling apart than my native German names, and due to their audio setup and the different language I might have trouble distinguishing their voices. Worse, Microsoft teams just shows a bubble with initials in it. Having some kind of visual anchor, like this kind of virtual representation, would really help me remember people and calls a lot better, and nobody would have to share their video if they just got out of bed.


On Google Hangouts, people can upload a static avatar image that shows in place of their initials. And when they speak, there is a visual sound level indicator so you can tell which person is talking. I haven't tried this in Teams but the process looks very similar.


I agree. In addition to a visual anchor, I find that it makes handoffs a lot easier - there are visual queues that can be used to see when someone is about to talk, which makes it easier to avoid talking over each other and knowing when to stop talking.


I'm hearing impaired. The visual channel contains useful information for people like me, at least when there's adequate framerate and quality. (I have a feeling something like this synthesized video is going to be an anti-pattern for accessibility though in practice.)


Basically agreed. As if home video work isn't creepy enough.

Otoh, the more virtual our projections become, the easier it will be to manipulate them for fun :)


Manager: can everybody just wave their hands? just want to see if anyone is using a deep fake.


So how far are we before I can have a virtual autonomous avatar replace me in meetings?


No code :(


two things come to my mind, as a non-expert in AI/ML etc:

1. deep fake people will be super-happy

2. cops will be super happy too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: