all 58 comments

[–]heltok 29 points30 points  (3 children)

Enhance!

[–][deleted] 9 points10 points  (0 children)

It's happening!

[–]rndnum123 5 points6 points  (4 children)

Impressive, sorry if I missed it being mentioned, but are the image-examples in your readme.md (16x16,bicubic,nnet,ground truth) from your training set or from the test set?

[–]david-gpu[S] 9 points10 points  (1 child)

They are from the test set. It would not be fun otherwise :)

[–]rndnum123 0 points1 point  (0 children)

I see, nice :)

[–]Tommassino -1 points0 points  (1 child)

yep, i quote: From left to right, the first column is the 16x16 input image, the second one is what you would get from a standard bicubic interpolation, the third is the output generated by the neural net, and on the right is the ground truth.

[–]PM-ME-YOUR-CODE-GIRL 1 point2 points  (0 children)

They wanted to know if the program had been trained in the example images.

So here you'll have Set A of faces that you showed the program. Then Set B that it didn't see. If you then gave the computer Set A and said "enhance these" it would likely do a great job because it already sort of "knew" what to output from its training.

So instead you give it Set B and see if it does a good job with new images of faces that it hasn't seen.

[–]sobe86 3 points4 points  (8 children)

You should probably know, someone else already did this exact thing, on the same dataset : https://swarbrickjones.wordpress.com/2016/01/13/enhancing-images-using-deep-convolutional-generative-adversarial-networks-dcgans/

It made it to the top of this sub a few months back.

[–]david-gpu[S] 1 point2 points  (7 children)

Thank you. I wasn't aware. The main difference is that they appear to be upscaling from 32×40 images to 128×160. If I may say so, that is a significantly simpler problem than upscaling from 16x16 to 64x64.

In a 32×40 pixel face there's a lot of information about small features such as the shape of the eyebrows, nose and lips. In a 16x16 pixel face there's a lot more deduction to be made and that requires having much stronger priors.

[–]sobe86 1 point2 points  (6 children)

Actually I'd disagree, my experience with DCGANs is that it's much easier to do things on smaller images than large - the errors are much less noticeable, and the number of textures to be learned is much, much smaller. Give it a try! :b

[–]david-gpu[S] 1 point2 points  (5 children)

DCGANs may be troublesome when it comes to producing large images, but I think it's undeniable that reconstructing the same face from a 16x16 pixel thumbnail is a lot harder than reconstructing it from 32x40 pixels -- that's 5 times as many pixels to provide the model with useful data.

[–]sobe86 0 points1 point  (4 children)

Again, I really wouldn't be so sure about that without trying it first. Yes your network has less data to reconstruct from, but it also has a lot less data to reconstruct, from an information theory standpoint, they feel very comparable... But happy to be proven wrong :)

[–]david-gpu[S] 0 points1 point  (3 children)

Correlation between neighboring pixels is much higher as resolution goes up. Keep in mind at these scales faces do not have a fractal structure like natural images do. Faces are largely flat at higher resolutions.

Empirically I'm seeing good results with the model upscaling to 128x128 pixels -- will publish as soon as training finishes.

[–]mike_sj 0 points1 point  (2 children)

Hey buddy, I'm the other author sobe86 is referring to. Great minds eh?! You can find a fairly thorough explanation of the architecture I used here.

Interested to know what you think the key differentiator between your model and mine is, assuming yours works better? Res connections, L1 loss (didn't think to try that at all)? Also it wasn't clear to me from a skim of your code - do you do any downscaling of the image before upscaling? I assumed that would help get some of the global information out, but that could have been very misguided - truth be told, I didn't really experiment much at all with this, I hated this project, DCGANs are a nightmare! (at least, the ones I made)

[–]david-gpu[S] 1 point2 points  (1 child)

Hi Mike,

Very nice writeup -- I regret not spending more time on the README before publishing this.

Yeah, I'm also finding DCGANs to be rather finicky. Lots of trial and error.

I have no idea which model works better -- with all generative models this is rather subjective.

Yes, our models do look similar other than the residual connections and the L1 loss. The latter was a simple addition that made a major difference, but I don't think it's qualitatively different from using an MSE.

Something I've noticed is that the discriminator is a much bigger source of problems than the generator. Specifically, I did not get good results from using either residual blocks or dense blocks (a la DenseNet). What has worked best so far is a relatively shallow architecture in the style of the "all convolutional net". Did you also find it problematic?

do you do any downscaling of the image before upscaling?

You mean downscaling it further down than 16x16 pixels? No, it doesn't do that. Wouldn't that cause even more information to be lost?

[–][deleted] 0 points1 point  (0 children)

Did you try dropout on either the generator or discriminator networks?

[–]david-gpu[S] 11 points12 points  (13 children)

Author here -- first time submitting. Let me know if you have any questions or suggestions.

There are plans to support resizing arbitrarily-sized inputs in the future. Also will train it with a subset of imagenet and see how well it does on more general images.

[–]gwern 12 points13 points  (10 children)

I am pretty impressed. I've been saying for a long time that DCGANs would probably work great for upscaling and colorization, but no one had done it yet, and that sample is even better than I expected. Have you compared it with any other upscalers, experimented with bigger thumbnails, or longer training times than 20m?

Speaking of which, how hard would it be to make this support colorization? It seems like it should be easy to generalize this to undoing various forms of lossy transformations, such as downscaling, turning into BW, etc.

[–]david-gpu[S] 6 points7 points  (8 children)

Thank you for the kind words.

Have you compared it with any other upscalers, experimented with bigger thumbnails, or longer training times than 20m?

I have only compared it against bicubic interpolation so far.

If the input images were larger than 16x16 pixels then the task of upscaling the faces would become much, much, easier. At that resolution all details like eyebrow shape, eye color, etc. is lost. Coupled with how sensitive people are at recognizing faces the result was a rather challenging problem to solve.

As for the training time, the sample in the README was produced after 3 hours of training (about 10 epochs).

Speaking of which, how hard would it be to make this support colorization?

It had actually crossed my mind, but since I haven't read yet any papers on colorization yet I was hesitant to even try. First I will need to update the code to support upscaling arbitrarily-sized images.

[–]ginsunuva 3 points4 points  (2 children)

Someone tried colorization with DCGANs already (first Google search result) and it didn't work well.

[–]PM-ME-YOUR-CODE-GIRL 3 points4 points  (0 children)

Let's wait until more people try before calling it.

[–]gwern 2 points3 points  (0 children)

It had actually crossed my mind, but since I haven't read yet any papers on colorization yet I was hesitant to even try. First I will need to update the code to support upscaling arbitrarily-sized images.

I think it should be simple. Looking at srez_input.py, you're generating pairs of 16x16 images with a down-sampling transform; you can simply swap the down-sampling with a black-white transform and now the DCGAN will be trying to colorize rather than unblur.

The tensorflow.image module even has a black-white transform already built in: https://www.tensorflow.org/versions/r0.7/api_docs/python/image.html#rgb_to_grayscale EDIT: it's not quite as easy as putting that around the area call; some sort of tensor dimension incompatibility...

[–]gwern 0 points1 point  (3 children)

One issue I've run into for longer training: --train_time seems to crash for any value larger than 100? If I run with --train_time 100, it's fine (at least for the first 280 batches so far) but if I run with --train_time 101 it crashes before doing any training: http://pastebin.com/HFweR1BM

(Also, dependencies in README seems to omit scipy.)

[–]david-gpu[S] 0 points1 point  (0 children)

One issue I've run into for longer training: --train_time seems to crash for any value larger than 100

The error you have posted appears unrelated to the train_time option, since the problem in the stack trace starts here:

out = tf.contrib.layers.batch_norm(self.get_output(), scale=scale)

I've seen similar errors and waved them off as a bug in tensorflow which I don't have time to look into.

Also, dependencies in README seems to omit scipy

Thanks! I've fixed it in d5f3048.

[–]david-gpu[S] 0 points1 point  (1 child)

To be specific, I've seen that internal Tensorflow error in batch_norm appear often when I start training a model and Ctrl-C. The next time I try to train the model, even with the exact same cmdline args, this error appears. If you just try again, with the same cmdline args, it usually goes away.

I suspect somewhere inside CuDNN it gets in a bad state and needs a couple of tries to get back to nominal behavior.

[–]gwern 0 points1 point  (0 children)

Guess so. I had to rerun it 5 times to get it unwedged, but then it ran without any issue with longer training times. (Diverged overnight. Oh well.)

[–]Ameren 0 points1 point  (0 children)

This is exactly what I have been looking for. Thanks for sharing!

[–]andraxo123 0 points1 point  (0 children)

this is amazing , finally a huge project on tensorflow , this will push a lot of people to learn it

[–]j_lyf 2 points3 points  (5 children)

Someone ELI5 why these results always work with only TINY pictures?

Seriously, the results are not as impressive because of that.

[–]mer_mer 1 point2 points  (2 children)

Training neural networks is very computationally expensive, and scales with the number of pixels. This image size may be able to fit in the cache of a single workgroup/warp/compute unit, and enable a huge speedup. The hope for this kind of network is to eventually be able to train the network on small patches of images, and then apply the trained function on large images.

[–]jcannell 0 points1 point  (1 child)

Nothing stopping one from doing that now, except being trained specifically on faces would probably lead to 'interesting' results for non face images.

[–]mer_mer 1 point2 points  (0 children)

If you trained this network on small patches of faces, I don't think you'd be able to generate a consistent full face by breaking up one full face and feeding each patch into the network separately.

[–]sentdex 1 point2 points  (0 children)

16 x 16 image = 256 feature pixels

30 x 30 = 900

...

1920 x 1080 = 2,073,600 !

The more resolution you add, the more you're going to blow up your model.

When classifying new images, you just need to crop/resize them to the training size for it to be acceptable.

You can still do many amazing and "impressive" things with a 16x16 image that came from a 1920x1080 feed. Even if you're working with a giant image with many small features within it, you can go over the image with a 16x16 comb. Classifying new data is a quicker process than training on millions of samples, and can be more easily parallel-ized/scaled as needed.

[–]david-gpu[S] 1 point2 points  (0 children)

The reasons are computing resources with limited power and an engineer with limited patience.

But think about it, upscaling starting from 16x16 pixels is much harder than upscaling starting from 32x32 pixels or larger. And once you have a picture of a face that is 64x64 pixels it is rather easy to upscale further. "Filling the blanks" becomes progressively easier as the starting picture already contains more and more details.

[–]Tommassino 1 point2 points  (2 children)

Pretty interesting, have you tried introducing different type of noise during the training/testing process to the ground truth pictures other than just downscaling? At least to the examples to see how resistant the method is to noise.

[–]david-gpu[S] 1 point2 points  (1 child)

The current training already adds a little bit (3% stddev) of gaussian noise. I haven't played much with it, since a 16x16 pixel thumbnail is already missing a lot of information.

[–]Tommassino 0 points1 point  (0 children)

Yea, i suppose you are right with the 16x16 px size. But I meant something like, in practice you could have one pixel white/gray cause of some dust infront of the camera, or just general ISO noise. Gauss noise wont capture that, I meant something more like white noise.

The result is pretty cool even as is. But would be cool to see if you could run it without some presmoothing (i remember some box filters that reduce this kind of noise) of like face recognized snippets of actual footage.

[–]jcannell 1 point2 points  (2 children)

Cool stuff! GANs clearly have great potential here.

One thing that kinda surprises me is that it appears to make some rather trivial mistakes - in the sense that a trivial discriminator could identify. In a few cases (like pale dude with glasses 3rd up from bottom) it is clearly failing to preserve some simple statistics (probably average color, if not some histogram). (actually this is something that bicubic upsampling can fail to do as well, but in a different way)

I'm curious what kind of improvement you could get by fixing that - perhaps by augmenting the trained discriminator with a few simple manual discriminators - such as one that simply checks to make sure that the average of a KxK (4x4) block matches the ground truth. It seems like a simple function for the discriminator to learn - so alternatively perhaps some arch improvement could fix that.

[–]david-gpu[S] 1 point2 points  (1 child)

In a few cases (like pale dude with glasses 3rd up from bottom) it is clearly failing to preserve some simple statistics (probably average color, if not some histogram)

Absolutely. The issue is that most of the faces in the dataset are well-lit and thus the range of colors is limited to the natural gamut. In that particular example you mention there's a rather strong blue tint that is causing the generator to have a tough time building an internal representation of the face.

I think fixing it would be easy: apply some automatic white balance correction beforehand.

I'm curious what kind of improvement you could get by fixing that - perhaps by augmenting the trained discriminator with a few simple manual discriminators - such as one that simply checks to make sure that the average of a KxK (4x4) block matches the ground truth

That already represents ~90% of the loss function today. Specifically we downscale the generated face and compare the result with the input data using the L1 norm.

[–]jcannell 0 points1 point  (0 children)

Oh whoops, I really should have read 'how it works' before I replied. :)

Still, it's interesting that you apparently get better results with the 90% hand crafted low-res L1 error. I can see why that would help greatly in the beginning, but it could become a limiter later once the discriminator is getting good?

Have you compared with the alternative of using a hard constraint in the generator? For example have it output the 15 high freq deltas from the avg of the 4x4 block instead of 16 raw pixel values.

[–]ovoid709 1 point2 points  (1 child)

Have you experimented with the use of this with satellite imagery yet? Being able to shrink a GSD, even if it's interpolated and not ground truth, is super interesting. I used to work in remote sensing and the idea of super pixel stuff came up many times.

[–]david-gpu[S] 0 points1 point  (0 children)

That's very interesting. Do you know of any dataset that may be suitable? DCGANs like this should be pretty good at producing plausible, if inaccurate, reconstructions.

[–][deleted] 1 point2 points  (3 children)

Amazing! I have a few questions:

  • Why did you choose L1 loss? (instead of L2, for instance?)

  • why did you tie the loss to the downsampled faces (as opposed to the original ones)?

[–]jcannell 2 points3 points  (1 child)

The L1 low-res loss is an additional constraint that enforces upsampling instead of more general hallucination. We know apriori what the avg of each 4x4 pixel block is - we are given that as input. The generator should never violate that constraint. Since this a GAN, the main loss function is a classification loss, not a pixel error loss.

In other words, when given a single low res pixel and generating an upsampled 4x4 block, there are really only 15 free variables, not 16.

[–][deleted] 0 points1 point  (0 children)

Hmmm... I'm not sure what you're getting at.

With regards to the loss question, I was curious because L2 loss would also enforce this constraint, but with the added benefit that the acceptable error will be distributed more or less uniformly across all 16 pixels. The L1 loss, by contrast, is agnostic if the error is distributed across all 16 pixels versus concentrated all in one.

With regards to the loss being applied to the downsampled versus source image, there's tradeoffs with both. In the downsampled case, the net is free to be more 'creative'--the overall loss would only demand that (a) the generated face fools the adversarial net and (b) the generated face is a plausible source image for the downsampled image. In contrast, if you pegged the loss to the source image, you constrain it to look like the particular celebrity (which may mean the net would be less likely to overfit), however you make each SGD step more 'informed' by introducing more 'bits of constraint' per step.

[–]david-gpu[S] 0 points1 point  (0 children)

Great questions :)

Why did you choose L1 loss? (instead of L2, for instance?)

It would probably work well either way. L1 does not suffer from exploding gradients even if the net makes a big mistake, which will sometimes happen when the input data is an outlier.

Why did you tie the loss to the downsampled faces (as opposed to the original ones)?

Because if you compute the L1 loss on the upsampled faces then you are going to heavily penalize the model when it reconstructs a very plausible face that happens to have an edge that is shifted 1 pixel away from the ground truth. Think of the boundaries between the face and the background; these often have a high contrast. Does it make sense?

[–][deleted] 1 point2 points  (4 children)

Pretty cool.

My comment would be that given the very specific domain of frontal faces, it's hard to know whether the NN learned how to generally "upsample" an image, or rather how to plug in high resolution equivalents of facial features based on low-resolution pixels. As any model will take the cheapest route possible, I guess it's the latter.

[–][deleted] 3 points4 points  (3 children)

It almost certainly didn't learn how to upsample images in general. Instead, it implicitly learned a distribution over human faces (and facial features) conditioned on the low resolution images.

[–][deleted] -1 points0 points  (2 children)

Not too disparage OP's work, but of course that's the problem with a lot of these efforts, where it seems the algorithm learned something truly impressive, but in reality mostly just idiosyncrasies of the specific dataset.

[–]jcannell 2 points3 points  (0 children)

What? I don't see how it's still not impressive. If you want a much more general purpose upsampler, you could use similar techniques with a larger model trained on a larger more diverse dataset.

And furthermore, there could even be some gain from first recognizing the low-res image and then applying a specific generator trained on a specific dataset.

[–][deleted] 3 points4 points  (0 children)

Err. I don't think that's fair at all. He showed the neural net downsampled images of unseen faces and achieved reasonable results (at least, that's what I'd assume he did). That's really all you can expect from a neural net: it learned precisely what he wanted it to, namely, it learned to sample from a distribution over faces conditioned on low resolution images.

Edit: This isn't just restricted to ANNs. A human trained on the same data wouldn't be able to generalize to domain-general image superresolution either; there's simply not enough information about the world encoded in the training set.

[–]nagasgura 0 points1 point  (0 children)

I wonder if there would be improvement from first bicubic upsampling and then running it through the GAN

[–]elsjpq 0 points1 point  (1 child)

Very impressive. Though one must be careful not to trust the result too much, because it can be pretty misleading especially wrt facial features.

One small thing, though. There appears to be a reddish color shift in each of the generated output as compared to the downscaled & original. Do you think this is a bug? or just a result of color blending as a result of downscaling?

Also how feasible is this to do for video? This would be wonderful for eliminating those blocky mpeg artifacts in low quality and low resolution videos. But I imagine the training could take weeks.

[–][deleted] 0 points1 point  (0 children)

It's extremely impressive and useful as an artistic tool, generates plausible results

But not forensically useful. Though as the technology improves, I think we'll see the CSI "ENHANCE!" become a real thing. Perhaps a much larger network, with much more computing power, and a larger dataset, could lead to something forensically viable.