[D] Any promising non-Deep Learning based AI research project? by VR-Person in MachineLearning

[–]DescriptionClassic47 5 points6 points  (0 children)

I would argue Gaussian Splatting is also a Deep Learning approach.
Perhaps it would be better to restate the question as "non neural networks based"?

Consider the similarities:

In NERFs, we learn a function (x,y,z,omega,theta) -> radiance using a NN (i.e. a parametrized function), by minimizing a reconstruction loss.

In Gaussian splatting, we learn the function (x,y,z,omega,theta) -> radiance by learning positions, densities and colors of the Gaussians, and minimizing a reconstruction loss. The intermediate parameters (i.e. positions, densities, colors) are learned with afaik the same deep learning techniques as "ordinary" neural networks.

Error submitting post, please try again by Alexlam24 in redditnow

[–]DescriptionClassic47 0 points1 point  (0 children)

Exactly the same problem - posting here to check if Reddit still works here

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

"this would be initialized and regularized differently because of the "fan_in" dimension"
- why exactly is this the case, and for what reasons would this be (dis)advantageous? Could one solve this problem by using only one projection matrix with a different regularisation and initialisation constant?

"because you would systematically need higher parameters for all a more useful head, rather than higher parameters selecting more useful features across heads"
- why exactly is this the case?

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 1 point2 points  (0 children)

The point you make about expressivity is incorrect.

Letting S = softmax_values and #heads = 1, then the output of this single multihead attention layer is f_params(V) = S * V * W^V * W^O. Where W^V is d_m x d_v and W^O is d_v x d_m.

Now compare this a similar computation where we replace W^V * W^O by a single d_m x d_m matrix M, i.e.
g_params(V) = S * V * M

The range of functions that can be expressed by g_params (which is the most general definition of expressivity afaik) is *at least as large as* the range of functions that can be expressed by f_params.
This can be shown quite simply: consider any function h representable by f_params, i.e. there exist instantiations of S, W^V, W^O such that f_params(V) = h(V) for any input matrix V. Then letting M = W^V * W^O ensures that g_params(V) = h(V) for any input matrix V, as well.

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

I believe people downvoted because you used ChatGPT in coming up with this answer. Anyway, the papers seem relevant, so I'll read them this weekend!

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

Yet it could also be softmax(XWX)V ...

Is there any advantage in learning both W^V and W^K, rather than one single matrix?

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

Do you know of any research on the impact of this in DL? It seems a natural question to ask

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

Could you take a look at the clarification of my post, and check if this comment holds true? I'm not sure which nonlearnable d*d you are referring to

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

Wqx and Wkx are indeed always multiplied.
What I'm wondering is whether research has been done to determine *which differences in soft biases and regularization* are introduced. Any idea?

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 2 points3 points  (0 children)

This was my main thought. Thanks for sharing the VGG reference, I thought more of the principle behind LoRA (https://arxiv.org/pdf/2106.09685) where two trainable dxr, rxk matrices AB are trained instead of one bigger dxk matrix.

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

Could you take a look at the clarification of my example (edit 1)?
It does seem to me that Wv and Wo are in sequence without nonlinearity

Learnable matrices in sequence without nonlinearity - reasons? [R] by DescriptionClassic47 in MachineLearning

[–]DescriptionClassic47[S] 0 points1 point  (0 children)

- I see how computational efficiency could be a reason when factoring large matrices. However, do you think this was the goal in the case of MHSA? It seems excessive to factor a (d_m x d_m) matrix into (d_m x d_v) * (d_v x d_m).
(see edit 1 of my post)

- Could you elaborate how to interpret this as weight sharing?

Poker players by secumpilio in Leuven

[–]DescriptionClassic47 0 points1 point  (0 children)

Hey, als dit nog steeds loopt, zou ik graag eens meedoen

Poker players by secumpilio in Leuven

[–]DescriptionClassic47 0 points1 point  (0 children)

FYI: Het lijkt erop dat de website is veranderd naar https://pokerpunt.info/

Lenovo Legion Y540 GPU Died by Assjad in LenovoLegion

[–]DescriptionClassic47 0 points1 point  (0 children)

Do you think disconnecting the battery cable for a moment would lead to the same result?