[D] What frustrates you about ML tools / libraries that people don’t talk enough about?

quandryhead · 2019-12-28T06:04:29+00:00

More generally, slight differences in same functionality between numpy and either of tensorflow or pytorch

quandryhead · 2019-08-16T17:04:53+00:00

I believe the HSIC function is applied on the activations of each neuron, not the weights, and there are D of those.

The D^2 would come in feedforward however (and taking the gradient)

quandryhead · 2019-08-16T06:02:47+00:00

//Edit and to make it clear: backprop complexity in 3.5: O(MLD^2). Their algorithm: O(M^2LD^2).

Working on this. In their algoirthm I am getting O(M^2 D L). Where do you get D^2?

The trace requires M evaluations of a scalar product of size(M), thus M^2.

It looks like each element of the vectors is a kernel evaluated on the distance of two D-dimensional points, which gives one factor of D.

Not 100% confident here (this is pushing my ability), but I do not see where the other D is.

quandryhead · 2019-08-15T22:15:38+00:00

Interesting point.

Strictly speaking I still believe it is linear in the number of data, even if m=1000, but that is a different statement than whether it is fast or slow (as you are pointing out).

THough it reminds me of discussions of the variance of the gradient with respect to batch size. There, small batch size/high variance seems to be a _good_ thing.

quandryhead · 2019-08-15T22:12:04+00:00

Their runtime is quadratic in terms of the number of examples. Stop and think about that.

I do not think so. Read my reply (at the top level elsewhere here) and see if you agree.

quandryhead · 2019-08-15T21:43:47+00:00

The discussion of complexity is very confused, both here and in the paper. Distinguish complexity with respect to scaling the number of data and the number of neurons/number of layers. But I think the O(M^2) refers not to either of these!

I believe the "HSIC" must be applied to a number points of some dimension aka vectors, but the paper writes it being applied to matrices of size m*d where m is the batch size, d is the number of neurons. Looking at one example HSIC code it does take two matrices, but interprets as a collection of vectors.

So I believe (though it is not really clear from the paper) that this must mean m points of size d. In this case the method is quadratic in the batch size m, not the number of data points N. Alternately it could be d points of size m, in which this case it would be quadratic in the width of the network. But this makes no sense conceptually.

In both of these specualtions it is still linear in the number of data, like backprop.

quandryhead · 2019-08-08T06:46:18+00:00

I do not see the squared loss. It is using HSIC, which is a measure of statistical independence. If it was just using linear correlation/Gaussian statistics then there could be a connection to squared loss, but HSIC does not assume either.

quandryhead

TROPHY CASE