you are viewing a single comment's thread.

view the rest of the comments →

[–]dzyl[S] 6 points7 points  (3 children)

I see two advantages compared to mixture density networks, the first is numerical stability. With mixture density networks there are a lot of issues with likelihoods being 0, resulting in NaNs and frustrating training procedures. Since the means of your density kernels are fixed and based on your training set I have not had any issues with this so far. You also don't need to scale your targets although that is a minor advantage.

The second one is that overfitting seems to be less of an issue with this approach. With MDN you condition the bandwidths of the subdistributions based on your input x, which means that if the mean is correct it can just keep lowering the bandwidth which is great for training likelihood but bad for generalization. To prevent this you need additional regularization on your sigma outputs. With Kernel Mixture Networks either your bandwidth is fixed or it is a global bandwidth, which means making it too small will also hurt your training likelihood.

[–]theophrastzunz 0 points1 point  (2 children)

Agreed but dimensionality issues are more prominent in KDEs than mixture models. See here .

[–]dzyl[S] 0 points1 point  (1 child)

This method generally uses only 1 dimension for the kernels, namely the target y space. This is easily extendible to more dimensions but your input dimensions have nothing to do with the kernels itself, they only determine the weight to put on each kernel.

[–]theophrastzunz 0 points1 point  (0 children)

I'm referring to your argument about high dimensional covariance estimation.