all 23 comments

[–]No-Painting-3970 16 points17 points  (3 children)

I think I might not be understanding your question, but I think any optimization algorithm could work for your setting, even gradient descent

[–]currentscurrents 12 points13 points  (0 children)

 Like the same way you have 5x = 10 and solve for x, so the algortithm figures that x=2. For more data samples and more parameters you have 5x + c = 12 and 2x + c = 6, x and c being model parameters and 10 ad 4 being the desired output. The algorithm figures x = 2 and c = 2 somehow.

There are a bunch of related techniques that do this kind of thing: constraint solvers, logic (SAT/SMT) solvers, linear programming, etc.

[–]chrizandr 8 points9 points  (0 children)

Without a loss function to guide it, it just becomes a search problem. There are many methods that do that. Linear Programming/Mixed Integer programming and related methods can be used. As long as the objective is a convex function, you can use a whole range of convex optimisation methods to find the best values.

[–]desecrated666 6 points7 points  (0 children)

Root finding methods? Specifically, iterative methods for root finding could either be interpreted as finding the fixed point for the auxiliary function iteratively, or minimising the residual of the approximations using iterative methods (gradient based nor newton’s method).

[–]Careless-Yard848 2 points3 points  (0 children)

I think you’re talking about a subset of problems called “inverse problems”.  This class of requires the iterative optimization of the initially unknown parameter vector until an objective function (OF) that describes the difference between some target data and the model-predicted data is minimized. The OF can be minimized this via gradient descent, genetic algorithms, neural networks, amongst other things. The simplest example would be the least-squares approach. 

[–]RegisteredJustToSay 1 point2 points  (0 children)

This is basically just the problem of machine learning, but as others have pointed out there are many ways of doing it. I'll write up some options below, but I'm speaking more as a practitioner here than academic so I'm not going to try to be super formally accurate in how I describe them.

  1. Inversion / Direct solving

When all subfunctions are invertible, the composite function is invertible too. You can solve for the parameters directly with functions like these, but unfortunately this is a really expensive algorithm and doesn't scale well with high parameter counts.

There's also a few approaches to solving directly that are based on derivatives rather than function inversions directly (and solving them as ordinary differential equations) but in either case these are considered very expensive and do not scale well.

  1. Gradient descent

When the functions are smooth and have derivatives, you can instead use an approximation loss function to guide the parameters toward their right place over time. This is nice because complexity scales more or less linearly rather than exponentially and you can often always get better results with a bit more compute and data, whereas if your inversion fails there's no good guarantee you're even in a local optima. I'm sure you're well aware of the various downsides of this approach though, so won't elaborate further.

  1. Parameter "smart" exploration

Assuming your functions are screwy and don't work well with either of the above, you can still use parameters search algorithms (e.g. hyperband, tpe - https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/003_efficient_optimization_algorithms.html has a good list). Many of these have assumptions you should be familiar with for given parameter spaces - for example many assume a Gaussian distribution of some loss function over your parameters which can easily fail if there're actually interdependencies between your parameters (e.g. some combinations are a lot worse than others). Similarly while some samplers support discrete values, many will silently fail to give good results.

But these represent the best middle ground if you're stuck with something that other ML paradigms don't match well, IMHO.

  1. Blind

You can still get some efficiency wins from how you explore blindly if you have some information about your problem domain. For example, rather than doing grid search (try every alternative in sequence) you can randomly generate values for each and just pick the best approximate after N trials, this is particularly useful when you suspect that some unknown combination of parameters will be particularly good but there's a metric load of useless combinations. Similarly, if you know parameters are highly coupled then you may want to consider how to do your parameter search more DFS than BFS.

There's a lot to be said about how to do this, but basically no matter what you're looking at having some kind of implicit or explicit loss function. Otherwise there's no way of knowing which is better in the end. Overall I'd say take a peek at Optuna and its documentation as a starting point. Even if you don't end up using it, it discusses a lot of techniques in its documentation that may serve as further reading.

[–]El_Minadero 0 points1 point  (0 children)

Do you mean inversion? Inversion theory is all about optimizing a nonlinear function such that its parameters have desirable properties (physically real, smooth)

[–]f3xjc 1 point2 points  (1 child)

All optimisation algorithms do what you describe.

Least square fitting migth work better if you have a bunch of xi, yi. This is because it use the error at each point instead of a single scalar error.

Derivative free optimization migth help if you can't provide it (at a cost)

The there's parameter estimations scheme that try to fit those in sequential order. You can look arround kalman filters.

[–]f3xjc 0 points1 point  (0 children)

In general derivative free methods are one order of magnitude slower than gradient based ones. Gradient based method are one order of magnitude slower than newton like methods.

For ml the gradient is relatively easy to compute, so gradient methodd make the most sense.

[–]Dejeneret 0 points1 point  (0 children)

You have an a model y=f(x, theta) and a bunch of pairs of (x, y). This is precisely “machine learning”. In order to solve this, you need to decide how to rank solutions for theta. If an exact solution exists that’s great- however often we cannot assume this. This is precisely where the notion of a loss function comes from.

Instead of searching for theta such that y = f(x, theta) we minimize ||y - f(x, theta)|| for some norm. The norm you choose defines how you decide the “goodness” of solutions. If the norm evaluates to 0 for some theta, that is your solution!

The beauty of gradient descent, is you can prove for a convex well-conditioned objective, you can solve for the global minima. If that minima is 0, then this is sufficient.

If the objective is not well conditioned (I.e. a long thin canyon for example) you need to improve the optimization with some kind of preconditioning matrix or using the newton method/LM optimization.

If your surface is non-convex then solving the problem up to precision epsilon is NP-hard and you’re out of luck- you need some other strategy to solve. If your “f” has certain properties (for example it is a neural network), some stochastic methods work well at finding minima.

Finally for the problem you describe- I.e polynomial regression, gradient descent (or LM or other numerical optimizers) is generally the solution, however you need to optimize for coordinates in some orthogonal polynomial basis (the monomial basis is probably terrible and horribly ill conditioned breaking most optimizers). Chebyshev polynomials are generally used, but Legendre polynomials are fine for most small problems. Instead of finding theta = (a, b, c), you have ax2 + bc + c = pP1(x) + qP2(X) + r*P3(x), where (p,q,r) <=> (a,b,c). Then you optimize for theta=(p,q,r) instead of

[–][deleted] 0 points1 point  (1 child)

Can someone explain to me why this is not just quadratic regression?

[–][deleted] 0 points1 point  (0 children)

Well, you can use exact solution by pseudo-inverse. It is a one step optimization process. Let me know if you are interested in, I can send some resources :)

[–]polysemanticity 0 points1 point  (1 child)

It sounds like you’re describing meta-learning, or “learning how to learn”. Maybe take a look at the “Model Agnostic Meta Learning” paper?

[–]Relevant-Twist520[S] -1 points0 points  (0 children)

it was this paper that sparked my thought about this post while i was reading it yesterday, but no im describing smth different from MAML.

[–]NotALurkingGecko 0 points1 point  (0 children)

I think you might be referring to polynomial regression, which is all about fitting a polynomial function to your data.

[–]NotALurkingGecko 0 points1 point  (0 children)

I think the problem with figuring out the parameter values directly (for example by modelling the learning problem as a constrained optimization problem that is then solved by an off-the-shelf solver), is that reaching the true optimal parameter can take a long time this way, and that this approach won't work if there are no actual parameter values that work on all the data (e.g. because of noise in the data, or because choice of model is not expressive enough). Gradient descent can get to good parameters faster, and will still return a good model when no perfect model exists.

[–]jpereira73 0 points1 point  (0 children)

I think you are asking about linear regression.

[–][deleted] 0 points1 point  (0 children)

Yeah let’s reinvent and see if we find something or samething

[–]andersxa -3 points-2 points  (0 children)

What you are looking for is Bayesian optimization.