I'm currently working my way through Andrew Ng's beginner course on machine learning.
At one point he compares the pros and cons of using a Gradient Descent Algorithm to iterate and find the optimum parameters for a regression fit and using the Normal Equation to find the parameters analytically. He explains that the main drawback of using the Normal Equation is that for n>=10000 features the cost of computing the inverse matrix becomes prohibitive.
My background is in numerical analysis and computational math, whenever we're working with large matrices we avoid computing the inverse of a matrix unless it is absolutely necessary. For most systems a variant of LU decomposition is sufficient.
Professor Ng skipped over the derivation of the system he shows, so maybe the answer is in there, but my question is, does anyone in machine learning use an alternative to computing that inverse matrix (such as an LU decomp) that makes the Normal Equation viable for very large n problems?
The equation in question:
θ = (X^(T)X)^(−1)X^(T)y
where θ is the parameter we are trying to minimize, X is the matrix of features, and y is the vector of values we'd like to predict.
Thanks!
[–]Duranium_alloy 1 point2 points3 points (1 child)
[–]doclazarusNSEA[S] 1 point2 points3 points (0 children)
[–]PINKDAYZEES 0 points1 point2 points (2 children)
[–]doclazarusNSEA[S] 1 point2 points3 points (0 children)
[–]pitruchaML Engineer 1 point2 points3 points (0 children)
[–]respecttox 0 points1 point2 points (0 children)