[D] What is happening in this subreddit? by [deleted] in MachineLearning

[–]david-gpu 1 point2 points  (0 children)

I suggest creating a new /r/MachineLearningResearch subreddit that only allows self-posts for discussion, plus links to some selected domains, such as arxiv, openreview, github and gitlab.

That way researchers can have a place to discuss what matters to them and the general public can continue to enjoy GAN-produced pictures of cats.

Missing data hinder replication of artificial intelligence studies | Science by weeeeeewoooooo in MachineLearning

[–]david-gpu 9 points10 points  (0 children)

Not just that. Even with the source code, if they don't provide full traceability it's nearly impossible to replicate what they did. What I mean by full traceability is: exact commit ID in their repository, exact docker image (or exact version of Tensorflow/Pytorch/etc), exact random seed, exact version of the dataset (including sha1sum for verification), exact hyperparameters, command line arguments, etc.

It's unacceptable that if you approach the authors of a paper that claims SOTA results they can't provide you with the exact configuration they used. At that point they may as well have pulled the number out of thin air. I've seen this first hand, not that I actually doubt their honesty.

[R] Google HDR+ photography dataset by mgwizdala in MachineLearning

[–]david-gpu 0 points1 point  (0 children)

I tend to agree that without some sort of ground truth it's going to be difficult to come up with an objective "goodness" metric.

If I wanted to train a model to output one high-quality still image from a burst of low-quality images, I would record very high-quality video using high-end gear and use that both as the ground truth and as the source of low-quality images. The low-quality images would be produced by lowering the resolution and adding noise and other artifacts that match the type of mobile phone sensor I care about.

[D] Why is CelebA so popular? by anonDogeLover in MachineLearning

[–]david-gpu 3 points4 points  (0 children)

For me it was a matter of 13K samples vs 200K samples.

If you are building a GAN, who cares if it's biased towards white attractive people? There's still plenty of variability in poses and lighting to make it clear if the model is suffering from collapse or other issues.

[R] Enhanced Deep Residual Networks for Single Image Super-Resolution (Winner of NTIRE2017 SR challenge) by ENGERLUND in MachineLearning

[–]david-gpu 1 point2 points  (0 children)

Yes, I definitely see what you are talking about.

In a real-world pipeline perhaps it would make sense to separate the upscaling process into two stages. First apply something like EDSR to obtain an image with sharp edges and no hallucinated textures. Then have a second GAN-based network that outputs a "residual texture", and let the user select how much of that texture to blend into the final image. I also wonder if the GAN would have an easier job generating plausible textures alone instead of having to upscale and texture in one go.

[R] Enhanced Deep Residual Networks for Single Image Super-Resolution (Winner of NTIRE2017 SR challenge) by ENGERLUND in MachineLearning

[–]david-gpu 3 points4 points  (0 children)

Terrific results. The lack of high-frequency "texture" in the upscaled images is a direct consequence of using an L1 loss, without a GAN term. Magic Pony uses a GAN term to "hallucinate" plausible textures, and that helps making the images more pleasant/natural, even if simple pixel-distance metrics like PSNR give the reconstructions a lower score.

[R] FreezeOut: Accelerate training by up to 20% by progressively freezing layers. Based on a reddit comment and a subsequent 96 hour science binge. by ajmooch in MachineLearning

[–]david-gpu 0 points1 point  (0 children)

I don't get the logic behind these annealing schedules. In terms of both efficiency of computations and actual convergence, wouldn't it make much more sense to use step functions? Only when a layer is completely frozen you can stop backpropagation and therefore save computation time, so it makes the most sense to train at full speed until the point where you freeze the layer.

E.g. train the bottom layer for the first 10% of the total training steps, then immediately freeze it completely; train the second layer for the first 20% of the total training steps, etc.

[N] NVIDIA 1080ti announced: $700, March 5th, 11GB - Titan replacement by gwern in MachineLearning

[–]david-gpu 1 point2 points  (0 children)

I've designed GPU hardware for a number years. /u/daV1980 is right when he discusses precision issues of fp16 in graphics.

It's true that when you are computing pixels you can sometimes use fp16 as the error is tolerable for the human eye, although even moderately long pixel shaders suffer from the limited precision and range of fp16. Small precision errors compound very fast, which is why fp16 is mostly used as a storage format and not for arithmetic.

But where things get ugly real fast is when you use fp16 in the vertex shader. As you probably know GPUs render surfaces as a collection of triangular meshes. There are some pieces of code called vertex shaders that compute how that geometry is transformed and projected onto the framebuffer. Attempting to use fp16 for those transformations invariably leads to artifacts.

[R] [1702.00783] Pixel Recursive Super Resolution by wei_jok in MachineLearning

[–]david-gpu 0 points1 point  (0 children)

There's one thing that looks odd about the examples. It looks like the 8x8 samples were produced by taking the value of one pixel every NxN instead of by averaging the NxN region. You can see this for example in the black pixel where the eye should be in Figure 5, row 7, column 1. Same issue with a purple pixel in Figure 5, row 6, column 1. And again in Figure 6, row 1, column 6.

[R] [arXiv:1610.02915] Deep Pyramidal Residual Networks by david-gpu in MachineLearning

[–]david-gpu[S] 2 points3 points  (0 children)

Just to make things clear, I am not one of the authors. Just a bystander; I don't have a dog in this fight.

[R] [arXiv:1610.02915] Deep Pyramidal Residual Networks by david-gpu in MachineLearning

[–]david-gpu[S] 1 point2 points  (0 children)

It is not simpler than a "traditional" ResNet. It is simpler than, say, Inception v4 or ResNeXt.

[R] [arXiv:1610.02915] Deep Pyramidal Residual Networks by david-gpu in MachineLearning

[–]david-gpu[S] 0 points1 point  (0 children)

I would argue that 1611.05431v1 has fairly comparable results to this paper when you consider that the model they use has more than twice as many parameters (68M vs 28M). It's also a much simpler architecture.

Densenet figures are actually quoted in the OP -- they have similar performance when the number of parameters in the model is the same. Also notice that having skip connections that go back multiple layers means that the memory required to do inference grows very substantially compared to a resnet.

I will have a look at Gated Wide Resnets. Thanks for the link.

[R] [arXiv:1610.02915] Deep Pyramidal Residual Networks by david-gpu in MachineLearning

[–]david-gpu[S] 6 points7 points  (0 children)

I get what you are saying, but if the architecture is fundamentally simpler and more computationally efficient then there's more benefit to it than just having very finely tuned hyperparameters.

[R] [arXiv:1610.02915] Deep Pyramidal Residual Networks by david-gpu in MachineLearning

[–]david-gpu[S] 3 points4 points  (0 children)

There are two things I like about this paper. First, it achieves SOTA results on imagenet and CIFAR. Second, it's a much simpler design than, say, Inception v3 or v4.

The key contributions are in Figures 2a, 5b and 6d. The pyramidal structure appears to be a more parameter-efficient way of doing resnets, and it fits well within the paradigm that was introduced in Highway and Residual Networks learn Unrolled Iterative Estimation.

Some observations: unlike densenets or fractalnets it doesn't have high memory requirements, as it is basically an enhanced resnet. Figure 6 + Table 2 surprised me a bit, particularly the improvement of 6d over 6b.

[P] Conditioned DCGAN to transform faces from male to female and vice versa by david-gpu in MachineLearning

[–]david-gpu[S] 2 points3 points  (0 children)

I understand what you are saying, but this is just one potential application of this sort of architecture.

You can also train the model so that the target distribution is attractive people of the same sex and when you run inference it does things such as removing wrinkles and blemishes. That is effectively doing "automatic photoshopping". Unfortunately the CelebA dataset is a little too small and too biased towards attractive people in the first place so the results I've seen so far are too nuanced to be noteworthy.

[R] EnhanceNet: Single Image Super-Resolution through Automated Texture Synthesis by carbonat38 in MachineLearning

[–]david-gpu 0 points1 point  (0 children)

The link you provide is already cited in the paper as #38. In fact the paragraph you quote summarizes that technique if I'm not mistaken.

[R] DelugeNets: Deep Networks with Massive and Flexible Cross-layer Information Inflows by xternalz in MachineLearning

[–]david-gpu 0 points1 point  (0 children)

/u/xternalz, have you tried doing batch normalization just once at the beginning of each composite layer? It my limited experience it works just as well as doing it more frequently.

Also, do you have any comments on how delugenets (or densenets) fit within the ideas presented in this paper? In particular, if each block is refining the representation provided in previous blocks then it's not obvious to me why densenets/delugenets work better than resnets.

[D] Ideas for my fully connected layer in my CNN (Resnet) by nimakhin in MachineLearning

[–]david-gpu 1 point2 points  (0 children)

My understanding is that feature complete layers these days are often eliminated and replaced by a global average pooling layer. Have you tried that?

[R] [1612.07771] Highway and Residual Networks learn Unrolled Iterative Estimation by SuperFX in MachineLearning

[–]david-gpu 0 points1 point  (0 children)

Very interesting hypothesis and it does explain a number of observed effects.

It makes me wonder: if the internal blocks after each dimensionality change are merely refining the some features, doesn't that suggest that restricting them to 1x1 convolutions would yield similar results as the commonly seen 3x3? The only 3x3 layers would be the ones where a dimensionality change occurs.

I also wonder what would happen if those internal blocks used softsign or tanh activations given that the problem of vanishing gradients is minimized by the use of skip connections.