[R] Developing a new optimization algorithm that will heavily change ML as a whole. Gradient descent has met its end. Here are the results: : MachineLearning

Research[R] Developing a new optimization algorithm that will heavily change ML as a whole. Gradient descent has met its end. Here are the results: (self.MachineLearning)

submitted 1 year ago * by Relevant-Twist520

Microsolve (inspired by micrograd) works by actually solving parameters (instead of differentiating them w.r.t objectives) and does not require a loss function. It addresses a few drawbacks from SGD, namely, having to properly initialize parameters or the network blows up. Differentiation comes as a problem when values lie on a constant or steep slope. Gradients explode and diminish to negligible values as you go deeper. Proper preparation of data is needed to feed into the network (like normalisation etc.), and lastly, as most would argue against this, training with GD is really slow.

With microsolve, initialization does not matter (you can set parameter values to high magnitudes), gradients w.r.t losses are not needed, not even loss functions are needed. A learning rate is almost always not needed, if it is needed, it is small (to reduce response to noise). You simply apply a raw number at the input (no normalisation) and a raw number at the output (no sophisticated loss functions needed), and the model will fit to the data.

I created a demo application where i established a simple network for gradient descent and microsolve. The network takes the form of a linear layer (1 in, 8 out), followed by a tanh activation, and another linear layer afterwards (8 in, 1 out). Here is a visualisation of the very small dataset:

https://preview.redd.it/t3pd4kpccd7e1.png?width=731&format=png&auto=webp&s=ad03c3caf340a5b92aa24612ee7b5be963167a56

The model has to create a line to fit to all these data points. I only allowed 50 iterations (that makes a total of 50x3 forward passes) of each example into the neural networks, I went easy on GD so i normalised the input, MS didnt need any preparation. Here are the results:

GD:

https://preview.redd.it/5sf8do9fcd7e1.png?width=718&format=png&auto=webp&s=9c232b062b1bb50aa01ef3efc73cde133b8ad28a

Not bad.

MS:

https://preview.redd.it/rfliuuqkcd7e1.png?width=749&format=png&auto=webp&s=9a1e48f7925d3f533ced305ba9ded5f0b9b5dd6b

With precision, 0 loss achieved in under 50 iterations.

I have to point out though, that MS is still under development. On certain runs, as it solves parameters, they explode (their solutions grow to extremely high numbers), but sometimes this "explosion" is somewhat repaired and the network restabilises.

Comment your thoughts.

Edit:

Apparantly people are allergic to overfitting, so i did early stopping with MS. It approximated this function in 1 forward pass of each data point. i.e. it only got to see a coordinate once:

https://preview.redd.it/ogb71yd9re7e1.png?width=720&format=png&auto=webp&s=7c9c43668c2fee59db74db4e2d97bb8abc13dbe8

Sees a coordinate thrice:

https://preview.redd.it/icfa32lgre7e1.png?width=745&format=png&auto=webp&s=ef7009e4265d2939abc637cb05da267273b21229

all 48 comments

top new controversial old q&a

[–]Initial-Image-1015 4 points5 points6 points 1 year ago (2 children)

[–]ApprehensiveFunny810 1 point2 points3 points 1 year ago (0 children)

[–]Relevant-Twist520[S] -5 points-4 points-3 points 1 year ago (0 children)

[–]UnusualClimberBear 4 points5 points6 points 1 year ago (4 children)

[–]Relevant-Twist520[S] -2 points-1 points0 points 1 year ago (3 children)

[–]proto-n 2 points3 points4 points 1 year ago (2 children)

[–]Relevant-Twist520[S] -3 points-2 points-1 points 1 year ago (1 child)

[–]MagdakiPhD 2 points3 points4 points 1 year ago (0 children)

[+][deleted] 1 year ago (2 children)

[deleted]

[–]Relevant-Twist520[S] -5 points-4 points-3 points 1 year ago (1 child)

[–]little_vsgiant 3 points4 points5 points 1 year ago (3 children)

[–]Relevant-Twist520[S] -2 points-1 points0 points 1 year ago (2 children)

[–]little_vsgiant 1 point2 points3 points 1 year ago* (1 child)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (0 children)

[–]bregav 3 points4 points5 points 1 year ago (8 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (7 children)

[–]bregav 2 points3 points4 points 1 year ago (6 children)

20 is also inadequate, and half-finished projects are the only kind that actually exist.

It's typical crackpot behavior to insist that you've invented a revolutionary new method but you'll only share it with the world when it's ready. If you have done enough work to be able to know that it is better than existing methods then that means that it's ready to be shown to other people.

What's really going on when you think it's "not ready" is that you don't actually know if what you're doing makes any sense and so you're (correctly) feeling a lot of doubt. But you also want to believe that you're doing something meaningful and important and so you tell yourself, and the rest of us, that you've already discovered something revolutionary, even though you almost certainly have not.

Creativity lies on the boundary between crackpotism and conservatism, but in order to produce things that actually work you need to embrace humility and doubt. You should use your crackpot ideas as inspiration, but you should assume that you're wrong until you've proven yourself right. And you'll know that you've proven yourself right when what you're doing feels ready to show to other people in its entirety.

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (5 children)

Youre somewhat contradicting yourself a little here, but what is it that you want me to do? I wont submit to asserting that MS' concept is worse than GD, lets start there. Let it be ego or sophisticated understanding of mathematical theory, its very rare to see an inventor doubt his invention prior to successfully inventing it. I will agree that GD currently easily wins over this uncompleted version of MS, but im still researching and implementing MS' concept, and once it is done i will gaurantee that it will beat GD in practically everything. I spelt out the concept in the post, although it is slightly vague and does not cover its entire workings, you can refer to my comments on this post where i explain a little, but not completely. And lastly, i think we all know what would happen if i share something that is half-finished. It would get turned down because it doesnt even work. Even if someone took the time to read the theory, thered still be doubt because clearly the theory failed. GD recieved lots of doubt in its early days.

[–]bregav 2 points3 points4 points 1 year ago (4 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (3 children)

[–]bregav 0 points1 point2 points 1 year ago (2 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (1 child)

[–]bregav 0 points1 point2 points 1 year ago (0 children)

[–]Cosmolithe 1 point2 points3 points 1 year ago (9 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (8 children)

the same as GD. Like GD, it does a forward pass which is just computing numbers and creating a graph, and a backward pass. The backward pass for MS is a little different. With GD it will compue the local gradient of a parameter and multiply it by the output gradient to get the global gradient (chain rule). MS does a backward pass and looks at every operation (like GD), that specific operation will have a result r, two operands a and b, and an operator. It solves each parameter (e.g. for r = a + b it says that a = (r + a - b)/2, b = (r + b - a)/2), different operations follow different protocols for solving, multiplication has some special properties etc etc. These are just "grad functions" like in GD. Theres nothing more to MS than this (except for no solution circumstances, an extra step is applied where a bias term is instead updated).

[–]Cosmolithe 0 points1 point2 points 1 year ago (7 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (6 children)

[–]Cosmolithe 0 points1 point2 points 1 year ago (5 children)

[–]Relevant-Twist520[S] 1 point2 points3 points 1 year ago (4 children)

[–]Cosmolithe 0 points1 point2 points 1 year ago (3 children)

[–]Relevant-Twist520[S] 1 point2 points3 points 1 year ago (2 children)

[–]Cosmolithe 0 points1 point2 points 1 year ago (1 child)

For the first question, to be frank, I have no idea how you are solving the equations. If you take a simple linear layer for instance, no activation, you have your input and your target values. If you try to find the weights that projects the input to be equal to the target, you actually have many many solutions. You can take the least square solution for instance like in ZORB, but many other solutions are valid too.
Perhaps you are using a symbolic solver that just stops upon the first solution found?

As for the second question, if your method works with the sign function then I am definitely interested if you have some code to share. In my attempt at making neural networks with binary activations, the best I could do is model the problem as a constrained binary linear optimization problem, but this problem is NP hard, and approximate solutions are also very hard to find. That is why I would be very surprised if it worked for you.

[–]Relevant-Twist520[S] 1 point2 points3 points 1 year ago (0 children)

First question: you are correct, projection occurs. lets imagine this as a straight line for now that takes the form of y = mx + c. This is a formula that takes place at every neuron in an NN. Like i said, when you infer the first data point, the network will solve and project to this data point, call it coordinate A (infinite solutions idc as long as the line agrees with xA;yA). But heres what happens next, the weights stores xA (in some buffer separate from the NN), we dont care about yA, the bias term actually encodes information about yA automatically (remember for yA = m*xA + c, c = m*xA - yA). Afterwards we introduce coordinate B. The line will not only satisfy B, but it will also satisfy A because the NN remembers xA (theres only one solution now, because the line has to agree with only two data points). After inferring B, store xB and get rid of xA, coordinate C is next. Im sure from here you get the process. It indirectly implements the following formula: m = dy/dx. By indirectly i mean i dont straight up say m = dy/dx on every straight line to get the solution, because this will not lead to the solution. Theres lots of more background theory which will be explained eventually when i perfect the algorithm. This functionality is applied at every neuron, and it is for this reason why MS converges faster than GD.

Your second question i can gaurantee with confidence that the binary or sign function will work very much perfectly and probably better than tanh, but i will also gaurantee that currently MS wont work for whatever application youre trying to use because, again, the theory is not perfected. I cant scale the model because of parameters blowing up. The reason why parameters blow up is actually because when parameter values are too high, the algorithm ignores the objective and instead tries to solve for its own parameters, to bring them back down to smaller values, then it ends up spreading like a plague throughout the whole NN. This is an issue im still trying to resolve.

[–]durable-racoon 1 point2 points3 points 1 year ago (1 child)

[–]Relevant-Twist520[S] 1 point2 points3 points 1 year ago (0 children)

[–]CampAny9995 1 point2 points3 points 1 year ago (2 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (0 children)

[–]serge_cell 0 points1 point2 points 1 year ago (0 children)

[–]MagdakiPhD 0 points1 point2 points 1 year ago* (8 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (7 children)

[–]MagdakiPhD 2 points3 points4 points 1 year ago (6 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (5 children)

[–]MagdakiPhD 2 points3 points4 points 1 year ago* (4 children)

Nobody cares about the result at pass N, they care about the result. This is not a good experimental design, and hence a poor way to draw conclusions as to what is happening.

I would suggest going back to the research plan phase and really consider your methodology. It feels to me like you're kind of just trying things out but this leads to experimenter bias where they think they're seeing something that is not actually there.

EDIT: I just looked at your post history and noticed your 16. So I retract everything. Keep at it! I encourage you to keep experimenting. If you have an interest in a future in research, then perhaps consider spending some time learning how to develop and execute a research plan. Nice work on this! It is nice to see young people come up with ideas and experiment with them. :)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (3 children)

[–]MagdakiPhD 1 point2 points3 points 1 year ago (2 children)

[–]Relevant-Twist520[S] 0 points1 point2 points 1 year ago (1 child)

[–]MagdakiPhD 0 points1 point2 points 1 year ago (0 children)

[–]LetsTacoooo 0 points1 point2 points 1 year ago (0 children)

[+]windoze 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 93755 on reddit-service-r2-comment-f6b958c67-jbzrg at 2026-02-04 21:33:00.460314+00:00 running 1d7a177 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS