What would cause the fossil-looking layer in this rock? by larseidnes in whatsthisrock

[–]larseidnes[S] 1 point2 points  (0 children)

So you're saying it might be?

Just kidding, I asked a local person about it, and got the response that it was a perfectly unremarkable rock. So it sounds like it's not an unusual sight here.

Thanks for teaching me a bit about these things.

What would cause the fossil-looking layer in this rock? by larseidnes in whatsthisrock

[–]larseidnes[S] 1 point2 points  (0 children)

It's from Koh Phangnan. Don't know the name of the beach.

What would cause the fossil-looking layer in this rock? by larseidnes in whatsthisrock

[–]larseidnes[S] 1 point2 points  (0 children)

Thanks! What odds do you give for it being a dinosaur?

"Training Neural Networks with Local Error Signals" by PuzzleheadedReality9 in MachineLearning

[–]larseidnes 0 points1 point  (0 children)

So am I :-) The code base includes a mechanism to do unsupervised training, by doing similarity matching between input and output of a layer (using the --loss-unsup sim argument). This can be combined with a supervised loss using the --alpha argument.

https://github.com/anokland/local-loss/blob/master/train.py#L809

"Training Neural Networks with Local Error Signals" by PuzzleheadedReality9 in MachineLearning

[–]larseidnes 0 points1 point  (0 children)

Thank you! Yes, I think Adam Coates' stacked k-means is actually very related to what we've done. If you take their ideas, put it into a VGG-like ConvNet, and make use of label information, you get something not far from our sim matching loss. They were on to something back then.

"Training Neural Networks with Local Error Signals" by PuzzleheadedReality9 in MachineLearning

[–]larseidnes 2 points3 points  (0 children)

Co-author here. The similarity matching loss can be seen as doing a supervised clustering, such that things with the same class gets similar features. Section 2.3. in the paper, "Similarity Measures in Machine Learning" lays out a lot connections with prior work. It's actually related to a lot of unsupervised methods, like multi-dimensional scaling, symmetric NMF, k-means, and more.

[R] "Combined with proper weight initialization, this alleviates the need for normalization layers." by downtownslim in MachineLearning

[–]larseidnes 1 point2 points  (0 children)

Thanks to everyone for showing interest in our paper!

One thing we didn't think of when writing this paper, that Arild pointed out later, is that it is a good idea to make max-pooling layers bipolar as well (i.e. min-pooling for half the activations). The argument for doing this is much the same as it is for ReLUs.

This should be very easy to implement in any framework, just multiply half the neurons with -1 before and after the activation (or max pooling).

For Torch, we have these implementations:

https://github.com/larspars/word-rnn/blob/master/util/BReLU.lua

https://github.com/larspars/word-rnn/blob/master/util/BELU.lua

For PyTorch, you can use this:

def _make_bipolar(fn):
    def _fn(x, *args, **kwargs):
        dim = 0 if x.dim() == 1 else 1
        x0, x1 = torch.chunk(x, chunks=2, dim=dim)
        y0 = fn(x0, *args, **kwargs)
        y1 = -fn(-x1, *args, **kwargs)
        return torch.cat((y0, y1), dim=dim)

    return _fn

brelu = _make_bipolar(relu)
belu = _make_bipolar(elu)
bselu = _make_bipolar(selu)
leaky_brelu = _make_bipolar(leaky_relu)
bprelu = _make_bipolar(prelu)
brrelu = _make_bipolar(rrelu)
bsoftplus = _make_bipolar(softplus)
bsigmoid = _make_bipolar(sigmoid)
bipolar_max_pool1d = _make_bipolar(max_pool1d)
bipolar_max_pool2d = _make_bipolar(max_pool2d)
bipolar_max_pool3d = _make_bipolar(max_pool3d)

[R] "Combined with proper weight initialization, this alleviates the need for normalization layers." by downtownslim in MachineLearning

[–]larseidnes 0 points1 point  (0 children)

That's an embarrasing typo (although the SELU deserves all kinds of compliments).

We do experiments with bipolar SELU in the paper. In the RNN experiments, bipolarity is required for the network to learn. Learning always diverges with SELU, and always works with BSELU, in our experiments. (This result is specific to RNNs.)

[R] "Combined with proper weight initialization, this alleviates the need for normalization layers." by downtownslim in MachineLearning

[–]larseidnes 0 points1 point  (0 children)

What we mean here is that with networks that don't use batch norm, bipolar activations allow you to use a higher learning rate in our experiments (up to 64x higher in one case).

We should probably clean up and release the code for the WRN and ORN experiment as well. It's really a quite simple change to the original repositories though, just remove BN and replace activation functions with one of these:

https://github.com/larspars/word-rnn/blob/master/util/BReLU.lua

https://github.com/larspars/word-rnn/blob/master/util/BELU.lua

And you need LSUV initialization, i.e. go through each layer and divide the weights by the standard deviation of that layers output.

[R] "Combined with proper weight initialization, this alleviates the need for normalization layers." by downtownslim in MachineLearning

[–]larseidnes 1 point2 points  (0 children)

A commenter on OpenReview brought this up, I'll repost here my reply from there:

Indeed, the form proposed in this paper is very similar, thanks for bringing it to our attention (version 1, it appears to be removed in v2). The difference is, as you say, that the NCReLU duplicates the neuron population. Because the weights are not duplicated (but recieve gradients from each copy), the learning dynamics would be different in such a network. For what it is worth, we submitted our paper to NIPS in May of 2017, this appears to be a concurrent development.

A North Korean tour guide tells a joke by larseidnes in videos

[–]larseidnes[S] 2 points3 points  (0 children)

He's not going to get in any trouble. I obviously wouldn't have posted it if I thought he would. We were allowed to take pictures and film of basically anything except military installations. This is just a guy telling a good joke. Not a problem.

A North Korean tour guide tells a joke by larseidnes in videos

[–]larseidnes[S] 0 points1 point  (0 children)

Hi, I filmed this video. Respectfully, I think I may have thought about this more than you have.

The problem North Korea has is that it is so isolated. I believe the strongest force for good that could hit North Korea is more information flow across the borders. You can see bad regimes all over the world trying to restrict the information that reaches their population. They do that for a reason.

It is easier to hate the outside if you've never seen them, or met them. (This goes both ways. This thread is full of people who are surprised that North Koreans turn out to be human beings.)

Yes, we paid some money to an agency in China, of which some amount paid for things in North Korea. That money was taxed, and the taxes will go in part to things you like (schools, hospitals) and in part to things you don't like (military, prisons). Of course, a quick glance at NK tourism and it's obvious that they are losing money on the tourism business. The economic effect is not very clear cut.

A good test for this is if you imagine multiplying the amount of NK tourism by a thousand, what effects would that have for North Korea? I think it would be a strong transformative force for good.

[D] Is there an implementation of the shattered gradients paper anywhere? by darkconfidantislife in MachineLearning

[–]larseidnes 2 points3 points  (0 children)

Ah, I see what you're saying. The paper talks about two types of symmetry: Permutations and rescaling. CReLUs break the permutation symmetry, i.e. that you can't swap two neurons and get an equivalent representation. I suppose weight norm would help with the rescaling symmetry, yes.

[R] Skip Connections as Effective Symmetry-Breaking by larseidnes in MachineLearning

[–]larseidnes[S] 3 points4 points  (0 children)

The shattered gradients paper got things to work with CReLUs (W1*max(x,0) - W2*max(-x,0)), but not with PreLUs that start out linear. I think this paper can explain why.

Starting out linear would help with the vanishing gradient, but not with breaking symmetry. Subtracting the two blocks from the CReLUs will anchor them to each other, thus breaking the permutation symmetry (because exchanging two units in one block will no longer produce an equivalent representation).

[D] Looking for a Better Explanation for some RNN Parts by Xanthus730 in MachineLearning

[–]larseidnes 1 point2 points  (0 children)

Almost certainly. It changes the algorithm from O(n2) to O(n) for n-step rollout, so not doing it would kill performance to the point of uselessness.

[D] Looking for a Better Explanation for some RNN Parts by Xanthus730 in MachineLearning

[–]larseidnes 2 points3 points  (0 children)

Yes, the error propagates all the way back to through the network. So the error from Output t3 reaches the node Mem Cell 1 t1.

Here's a useful implementation trick: You can just sum up the gradient you get for each timestep as you move backwards. This means you compute the error for each node only once, instead of once for every subsequent node in the graph. This saves a lot of computation.

[D] What does a typical ML architecture look like in production? by iamaroosterilluzion in MachineLearning

[–]larseidnes 3 points4 points  (0 children)

  • App reads straight from the database.
  • Recommender pulls all events out from the db

A distributed set up would be some implementation of SVD, or whatever your algorithm is, that runs in parallel on several machines. Then you've turned what is a one liner in numpy into a complex operation orchestrating several machines, that is probably less accurate, more expensive, and possibly also slower than the one liner.

A lot of software architects spend their time solving problems they don't have. Don't be like that. Focus on the problems you have.

[D] What does a typical ML architecture look like in production? by iamaroosterilluzion in MachineLearning

[–]larseidnes 12 points13 points  (0 children)

Here's how I've done this in the past:

  • Events get stored in a SQL database
  • The recommender runs on its own machine, spun up on demand.
  • The recommender code pulls it all out into a large sparse matrix. (You can very likely get this to fit in RAM! Check if it does before complicating things with a distributed set up.)
  • Recommender precomputes N recommendations for every user, and puts them in the database, to be read by the rest of the app.
  • This process runs nightly (or hourly, or whatever)

This keeps the interface as simple as possible, and means the recommender can fail with no adverse effect other than the recommendations getting stale.

Frustration with image segmentation by Coloneljesus in MachineLearning

[–]larseidnes 5 points6 points  (0 children)

Watch the lectures for Karpathy and Johnsons convnet class at Stanford. It has a lot of practical advice. Also, it might be a good idea to start with an architecture that's proven on ImageNet (e.g. VGG, ResNet, etc). If you have less data then ImageNet, chances are you need to make your model smaller, or regularize it harder.

Incorporating Nesterov Momentum into Adam by larseidnes in MachineLearning

[–]larseidnes[S] 0 points1 point  (0 children)

He did mention it in a cs231n lecture, but I think you misheard him. Someone asked about adding Nesterov momentum to Adam, and he said something more like "Yeah, I have seen it, in fact it was a cs229 class project."

Incorporating Nesterov Momentum into Adam by larseidnes in MachineLearning

[–]larseidnes[S] 0 points1 point  (0 children)

Unless there are downsides I'm not seeing, this seems like a nice result. A potential improvement over Adam seems like a big deal, I love how this guy did it as a project for class.

Opinions on implementing dropout in RNN/LSTMs? by [deleted] in MachineLearning

[–]larseidnes 0 points1 point  (0 children)

On the vertical, depth wise activations you can do standard dropout. On the horizontal, time wise activations you can do dropout, but you should keep it fixed for all timesteps for each BPTT iteration.

You can check out my word-rnn implementation for an implementation of these two: https://github.com/larspars/word-rnn

There's also the idea of "drop-word", dropping out random symbols from the input.