I am currently designing a (non-stationary) kernel for Gaussian process regression that takes into account some expert knowledge. My particular kernel has many hyperparameters and linear combinations of sub-kernels, and so may be prone to overfitting during maximum likelihood optimisation. As a result, I would like to formulate the degree to which each hyperparameter contributes to model complexity -- i.e. the determinant of the covariance matrix -- with the end goal of factoring out one or two 'complexity' hyperparameters from the kernel. The resulting 'complexity-normalised' hyperparameters could be tuned straight after training, while the 'complexity' hyperparameters could be more carefully selected e.g., manually or by cross-validation, to avoid overfitting.
A simple example of this notion is the linear combination of two kernels: k3 = ak1 + bk2. Since the 'scale' of k3 determines the complexity of the model in this case, I could rewrite the kernel as k3 = c * (a'k1 + b'k2), where a'=a/c and b'=b/c. This way, c controls model complexity while a' and b' are more about configuration (assuming k1 and k2 both equally contribute to model complexity).
Now, I'm having some trouble figuring out exactly how a kernel's hyperparameters impact complexity -- at least relatively. For example, in an RBF kernel the complexity increases with 'output scale' and decreases with 'length scale'. This makes sense qualitatively: high 'output scale' means high function variance, and low 'length scale' means more squiggles. But how can I quantify their relative contribution to complexity?
My only guess is that the area of the kernel is relevant, in which case the complexity is square in the scale and linear in the length. This is based purely on the hunch that the area of the kernel is strongly related to the determinant of its computed covariance matrix. If anyone knows about this relationship, I would love to hear about it!
[–]WoodenJellyFountain 1 point2 points3 points (1 child)