This is really cool tech and very awesome work by deepmind.
But man does this scare me. I remember a quote: “You furnish the pictures and I’ll furnish the war.” – William Randolph Hearst, January 25, 1898. This was in the lead up to the spanish american war.
Can you imagine tech like this/tech like "deepfakes" being used today? Fake news that was text alone has done and is doing damage in elections around the world now. Imagine that armed with pictures?!?!
In a dueling NN architecture, many say the discriminator will be able to detect the fake images. I wonder is there a threshold that a produced image is just too damn close to a real picture that even an equally good NN that is discriminating can no longer differentiate? In the end, both real and fake images are just pixel values... what would we do then?
Will be interesting if it ever reaches the point where it can be automated and scaled. I predict a modernized repeat of the war of the worlds tipping point followed by ratcheted skepticism for all kinds of temporal simulacra.
Then 3d imaging will enter widespread consumer usage and prove to be very difficult to convincingly reproduce by neural networks, until it is. Trust will be restored in some kind of media until it's broken. Rinse and repeat.
Here's another scary GAN proof of concept [0]. In this case, researchers transferred someone's face in real time to facial expression and mouth movements of public figures. Combined with DeepMind's new tech that seems to be able to produce human voice with believable candor and inflection [1], you could make some very convincing fake footage.
This isn't remotely near the state of the art for raw image generation; that would be something like ProGAN or pixel-CNN, which don't involve any reinforcement learning or paintbrushes, and already do photorealistic synthesis. Horse, barn.
The point of OP is that you're learning to generate little images in a much more difficult setup: you have to control some complex blackbox system (like a paintbrush robot) to try to generate an image with only crude success/failure feedback at the end of the sequence of actions. The hope is that by going through this intermediate environment, instead of generating an entire image in a single shot via convolutions, it'll be learning more abstract structure about what makes up a face etc, so hypothetically it could do things like rotate faces in 3D (whereas something like ProGAN only sees faces as 2D blobs so it can do things like add/subtract sunglasses or change hair color, but 3D transformations are beyond it). And, with this more abstract deeper understanding, it should be able to speed up learning in settings like robotics etc (instead of paintings, imagine they are videos of humans controlling pick-and-place robot arms); you can see this as one way of approaching unsupervised learning and providing primitives which a higher-level agent can learn faster from (somewhat like the GAIL architecture uses GANs for semi-supervised learning).
It seems to me like many of the computer-generated MNIST digits involved retracing the same contours multiple times.
Is it possible to (a) filter out these duplicate strokes, (b) convert them to heavier-weight single strokes, or (c) change the training regime to not produce duplicate strokes?
I can see that being useful for e.g. a real robot with a limited amount of ink or lead (or time to draw each character).
The reason for that is that I chose a particular brush from the set of available brushes ('dry brush'). Since MNIST digits are quite sharp and opaque, the agent tries to achieve this by retracing the contours. I guess the remedy is to pick an appropriate brush style or make the agent choose it.
Depends on whether you see the glass half-full or half-empty. Is it a DRL actor-critic where the reward & critic happen to be half of a GAN, or is it a GAN where the generator happens to receive a RL-style loss instead of the normal discriminator loss? Actor-critic and GANs have always been hard to tell apart: https://arxiv.org/pdf/1610.01945.pdf
Yeah, ok, so you basically reinvented inverse-graphics analysis-as-synthesis, stuck DEEP NEURAL and the GOOGLE DEEPMIND(TM) brand on it, and now you're acting like it's the bee's knees.
I'm starting to understand how Juergen Schmidhuber feels.
Yes, I saw that they are, but at that point, I'd have to ask where the novelty is besides transforming them from probabilistic problems to plain neural-network problems.
I would call "inverse graphics" a task. One can solve that task following different strategies. We demonstrate one way that uses RL and GANs and gives reasonably good results. For Omniglot, for example, there are works by Lake et al. that employ probabilistic perspective but the amount of hand-engineering involved makes their approach hard to apply to other tasks
"When trained to paint celebrity faces, the agent is capable of capturing the main traits of the face, such as shape, tone and hair style, much like a street artist would when painting a portrait with a limited number of brush strokes:"
That's interesting. Do we know how artists draw? Is it as "algorithmic" as the article lays it out? I don't draw so I always assumed it was more intuitive and personal rather than a "step by step" process.
It is eye-opening, even among fellow manga artists, to see how different sometimes their processes are.
Some may start with a definite sketch, others may go straight to ink with only the barest suggestion of a layout. Sometimes they struggle with expressions and may whiteout and re-ink (up to seven times in one of the videos.)
Some artists start inking with the eyes, some may start with an outline of the face. And so on.
There are a variety of methods. Some people will teach you formulaic approaches to drawing people/faces, and instruct you to always lay out the 'proper' measurements that most people fit, then just add detail. More traditional methods teach you to draw what you see, but focusing on the structural lines and forms of the person, while merging it with knowledge of anatomy, perspective, and lighting. And other methods are purely 'draw what you see', without additional context, trusting accurate copying to paper to look correct.
What any specific artist uses will vary greatly. But it usually falls into one of those three camps.
I think its more of, if you want to depict something with the minimal number of strokes, you really have no option but to look for the key, defining traits of the object. In that fashion, both the AI and the artist must operate similarly, simply due to the limitations implied by the task
But depicting those traits is another matter. You can render a chin meeting the hair in all sorts of ways; but your choices are limited to your aesthetic preferences, and your ability to draw that form.
Drawing is a highly mechanical process; choosing what/how to draw is a curated one.
I think there is a bit of both. If an artist learned on their own, it may be more intuition (or, step-by-step, but they don't realize it because they don't think about the individual steps). But if you take a drawing class you will learn a lot of steps that you can reproduce.
A good striking example, do a video search for "two point perspective drawing", and look at some of the tutorials / demonstrations that come up.
Intuitive and methodical drawing aren't mutually exclusive. You can derive a method from intuition. Skilled draughtsmanship is somehow technical but will always lack something when confined to purely methodical rendering. Most people who draw cannot exactly explain how they do it. And yet it is a completely learnable skill—albeit somewhat difficult to teach.
While it is true that one can simulate painting by search (in some sense), the problem is that (naive) search doesn't always work (we have an example in the paper). Moreover, training an agent has a benefit of fast test-time inference (i.e., you give an image, and it paints it almost instantly). Of course, we haven't achieved the ultimate goal yet but it's a step in that direction.
Thanks for the reply. Where in the paper is the example given? Skimmed it but didn't see it. (Edit: nevermind, see it now)
Fast painting is a benefit I guess. My search/painting program is very computationally intensive.
Edit: I think I see the point of the paper now. Unguided search is going to be difficult in high-dimensional search spaces like this. So the NNs become a hopefully-effective heuristic guiding the search.
The (semi-) obvious next step is to do object/digit recognition with a Bayesian probability calculation, with probabilities bases on this image reconstruction process. In other words, we choose e.g. digits based on how likely they are to have been drawn to give the target image.
I have experimented a little with this idea, but with no successful results so far (plain old NNs still beat it).
The main difference is that for search you need to specify an image to emulate it, while the RL method aim to generate new ones after observing some examples.
But man does this scare me. I remember a quote: “You furnish the pictures and I’ll furnish the war.” – William Randolph Hearst, January 25, 1898. This was in the lead up to the spanish american war.
Can you imagine tech like this/tech like "deepfakes" being used today? Fake news that was text alone has done and is doing damage in elections around the world now. Imagine that armed with pictures?!?!
In a dueling NN architecture, many say the discriminator will be able to detect the fake images. I wonder is there a threshold that a produced image is just too damn close to a real picture that even an equally good NN that is discriminating can no longer differentiate? In the end, both real and fake images are just pixel values... what would we do then?
Cool tech, scary possibilities.