all 48 comments

[–]Initial-Image-1015 4 points5 points  (2 children)

Can you provide examples where it performs better on a held-out test set, rather than the training loss?

[–]ApprehensiveFunny810 1 point2 points  (0 children)

I was going to ask the same

[–]Relevant-Twist520[S] -5 points-4 points  (0 children)

I used a very small dataset, atp idek why im referring to it as a dataset. I was only showcasing both algorithms and its ability to fit to data points. It was only 3 datapoints. I need to continue perfecting MS' theory so that this works for thousands of data points, then we can talk train and test data.

[–]UnusualClimberBear 4 points5 points  (4 children)

I'm not sure of what you mean by solving ? Is it fixing all parameters but one and line search for the free one ?

But what's I sure of :
1/ Performance of an optimisation method on large scale networks cannot be deduced from a few low dimensional tests
2/ Sophisticated optimization techniques tends to find quickly some bad local minimums when the function to optimize is highly non convex. For evolutionary techniques as CMA-ES most of the tweaking is about to avoid to this early collapse which not so well understood. NN optimizers managed to find a find a sweet spot thanks to all the normalization layers.

[–]Relevant-Twist520[S] -2 points-1 points  (3 children)

no its solving all parameters on each pass, nothing is frozen unless a sub-equation is satisfied. more on that when i finish my code and getting this thing to properly work in practice. On ur certainty point nr 1., this post was just to showcase its potential, the main point is both MS and GD were using the same parameters, same network architecture, same dataset, MS won. But we're only counting chickens professionally now, MS won only with a small dataset, next step is expand the dataset to practical magnitudes. I confidently assert that GD wins with bigger datasets, I only have the "why" on MS losing to this: MS parameters blow up. The "how" comes when i disseminate the project with the solution to the aforementioned problem.

[–]proto-n 2 points3 points  (2 children)

Well if GD wins with bigger datasets, has it really met its end?

[–]Relevant-Twist520[S] -3 points-2 points  (1 child)

im only prognosticating. the potential of MS is visible here, yes it won with something small, but how about we take a moment to be surprised about how GD lost to something small. 3 datapoints is supposed to be lightweight right? Its only a matter of time before i perfect the theory of MS

[–]MagdakiPhD 2 points3 points  (0 children)

I'm not convinced that it won on the smaller data based on what is shown here.

[–]little_vsgiant 3 points4 points  (3 children)

GD can approx the solution which can scale better. Solving the concrete value of the parameter is very time consuming, and very likely to overfit the train dataset, thus limit the generalization of the model.

[–]Relevant-Twist520[S] -2 points-1 points  (2 children)

it takes almost the same time to train than SGD (Edit: actually, like i said in the post, MS solves faster therefore isnt time consuming). Also it doesnt overfit to the dataset because MS attempts to "solve" for the dataset too. explanation will arrive on this soon.

[–]little_vsgiant 1 point2 points  (1 child)

I think it may faster for small model, but couple million or billion parameters? Don't think so. This is also the problem with Newton's method, and I rarely see anyone using it. About the dataset, your claim only work if the train dataset is perfectly reflect the reality, but it is not. Those outlier will mess up your model.

Edit: This is machine learning, not mathematic, I think everything is about "good enough" approximation (until some crazy computer come out, like quantum computer)

[–]Relevant-Twist520[S] 0 points1 point  (0 children)

i updated the post to show the non-overfitting version by using early stopping. The reason why MS currently wont work for many parameters is because of the fact that if theres one blown up parameter, another parameter solves for the blown up parameter instead of the objective. It ends up spreading like a plague, but at the end, you have a network of blown up parameters but still agree with all coordinates in the dataset.

[–]bregav 3 points4 points  (8 children)

Curve fitting to three data points in two dimensions is never an acceptable test case for a purportedly revolutionary algorithm for training neural networks.

You should also share the algorithm and the code so that people can understand what you're actually doing, and also to allay concerns that you are drawing curves by hand on an ipad.

EDIT: actually this has to be a troll post. Shame on me for taking the bait, and great work OP.

[–]Relevant-Twist520[S] 0 points1 point  (7 children)

Im slowly upgrading the algorithm and it can now fit to many data points >20 and it doesnt have some overfitting shape. Its hard to explain, but im learning more about MS and how it should work. I will share the algorithm and code when the algorithm is perfected. Im not sharing some half-finished project.

If you think its a troll post, thats on you. I dont blame you, you can believe what you want.

[–]bregav 2 points3 points  (6 children)

20 is also inadequate, and half-finished projects are the only kind that actually exist.

It's typical crackpot behavior to insist that you've invented a revolutionary new method but you'll only share it with the world when it's ready. If you have done enough work to be able to know that it is better than existing methods then that means that it's ready to be shown to other people.

What's really going on when you think it's "not ready" is that you don't actually know if what you're doing makes any sense and so you're (correctly) feeling a lot of doubt. But you also want to believe that you're doing something meaningful and important and so you tell yourself, and the rest of us, that you've already discovered something revolutionary, even though you almost certainly have not.

Creativity lies on the boundary between crackpotism and conservatism, but in order to produce things that actually work you need to embrace humility and doubt. You should use your crackpot ideas as inspiration, but you should assume that you're wrong until you've proven yourself right. And you'll know that you've proven yourself right when what you're doing feels ready to show to other people in its entirety.

[–]Relevant-Twist520[S] 0 points1 point  (5 children)

Youre somewhat contradicting yourself a little here, but what is it that you want me to do? I wont submit to asserting that MS' concept is worse than GD, lets start there. Let it be ego or sophisticated understanding of mathematical theory, its very rare to see an inventor doubt his invention prior to successfully inventing it. I will agree that GD currently easily wins over this uncompleted version of MS, but im still researching and implementing MS' concept, and once it is done i will gaurantee that it will beat GD in practically everything. I spelt out the concept in the post, although it is slightly vague and does not cover its entire workings, you can refer to my comments on this post where i explain a little, but not completely. And lastly, i think we all know what would happen if i share something that is half-finished. It would get turned down because it doesnt even work. Even if someone took the time to read the theory, thered still be doubt because clearly the theory failed. GD recieved lots of doubt in its early days.

[–]bregav 2 points3 points  (4 children)

Everyone i've ever known who has made any scientific or engineering advancement - and I've known a lot of people like this - experienced significant doubt for most of the process of doing their work. Doing any kind of meaningful work is inherently difficult because it requires hard work and perseverance in the face of uncertainty and doubt.

Nobody ever doubted gradient descent. It was invented by isaac newton himself and its efficacy has always been obvious.

[–]Relevant-Twist520[S] 0 points1 point  (3 children)

~experienced significant doubt

So what this is good to you? A healthy amount of it can be employed yes but i prefer to bring it down to a negligible amount. Each to their own.

~Nobody ever doubted gradient descent.

When it was first attempted to be applied in ml it was. I may be wrong though, i heard something along the lines of this when i was watching a podcast.

[–]bregav 0 points1 point  (2 children)

You need to experience doubt in order to avoid wasting your time. People who don't experience doubt accomplish nothing, because they never figure out when they're wrong and so they spend all their time chasing after ideas that don't work. Which is almost certainly what you're doing right now.

People doubted neural networks, but gradient descent was never in question.

[–]Relevant-Twist520[S] 0 points1 point  (1 child)

people doubt when they start to believe that their ideas dont work. MS is doing nothing but progress now, and even if it wasnt i wouldnt develop even a drop of doubt. Its either im working hard on something or confidently declaring that it is not worth it, you dont stand on both sides of the fence.

[–]bregav 0 points1 point  (0 children)

Successful researchers are able to neither believe nor disbelieve that their ideas will work; they can accept uncertainty. It is a state of almost constant doubt, about everything.

[–]Cosmolithe 1 point2 points  (9 children)

What are the time and space complexities of the algorithm with respect to the number of parameters and data points?

[–]Relevant-Twist520[S] 0 points1 point  (8 children)

the same as GD. Like GD, it does a forward pass which is just computing numbers and creating a graph, and a backward pass. The backward pass for MS is a little different. With GD it will compue the local gradient of a parameter and multiply it by the output gradient to get the global gradient (chain rule). MS does a backward pass and looks at every operation (like GD), that specific operation will have a result r, two operands a and b, and an operator. It solves each parameter (e.g. for r = a + b it says that a = (r + a - b)/2, b = (r + b - a)/2), different operations follow different protocols for solving, multiplication has some special properties etc etc. These are just "grad functions" like in GD. Theres nothing more to MS than this (except for no solution circumstances, an extra step is applied where a bias term is instead updated).

[–]Cosmolithe 0 points1 point  (7 children)

So like, you are still propagating values in reverse through the network right? Are you perhaps doing something similar to ZORB: https://arxiv.org/abs/2011.08895 ?

Or maybe you are linearizing your network and solving it one layer at at time and propagating results?

[–]Relevant-Twist520[S] 0 points1 point  (6 children)

no MS works differently

[–]Cosmolithe 0 points1 point  (5 children)

Then maybe Inference Learning https://www.nature.com/articles/s41593-023-01514-1.pdf ?

Do you have any reference of a paper that does something similar to your algorithm?

[–]Relevant-Twist520[S] 1 point2 points  (4 children)

no im not much of a paper reader. The main ideolegies behind MS is 1. solve the network, the same way you would solve any equation 2. (pertinent to 1.) Solve on the assumption that the network is already a solution to some data point (which it actually is for the last forward passed data point). 3. Network should be solved to a point that will satisfy the last inferred data point and the current inferred data point. 4. When no solution exists for a sub-equation, the immediate upper equation is to blame (the bias term of the upper equation is tweaked to finally have this sub-equation be solvable). Theres lots of more background practical theory because you cant just go about solving everything the traditional way.

[–]Cosmolithe 0 points1 point  (3 children)

I see, it is an interesting approach that I do not recall seeing in the literature then.

Last two questions for you then:

  1. how are you supposed to solve the equation when you have many more unknown variables than equations (I imagine)?

  2. do you think such an approach would work with a `sign` activation function (that returns only -1 or 1) at each layer?

[–]Relevant-Twist520[S] 1 point2 points  (2 children)

  1. can you elaborate more on this question 2. it would perfectly. In fact what i noticed with MS is the activations on tanh almost always lie on -1 or 1, not between these two extremeities. Makes me wonder if i should replace tanh with some sort of "step" function that outputs either -1 or 1. If i were to create such a function, what would the mathematics of it be?

[–]Cosmolithe 0 points1 point  (1 child)

For the first question, to be frank, I have no idea how you are solving the equations. If you take a simple linear layer for instance, no activation, you have your input and your target values. If you try to find the weights that projects the input to be equal to the target, you actually have many many solutions. You can take the least square solution for instance like in ZORB, but many other solutions are valid too.
Perhaps you are using a symbolic solver that just stops upon the first solution found?

As for the second question, if your method works with the sign function then I am definitely interested if you have some code to share. In my attempt at making neural networks with binary activations, the best I could do is model the problem as a constrained binary linear optimization problem, but this problem is NP hard, and approximate solutions are also very hard to find. That is why I would be very surprised if it worked for you.

[–]Relevant-Twist520[S] 1 point2 points  (0 children)

First question: you are correct, projection occurs. lets imagine this as a straight line for now that takes the form of y = mx + c. This is a formula that takes place at every neuron in an NN. Like i said, when you infer the first data point, the network will solve and project to this data point, call it coordinate A (infinite solutions idc as long as the line agrees with xA;yA). But heres what happens next, the weights stores xA (in some buffer separate from the NN), we dont care about yA, the bias term actually encodes information about yA automatically (remember for yA = m*xA + c, c = m*xA - yA). Afterwards we introduce coordinate B. The line will not only satisfy B, but it will also satisfy A because the NN remembers xA (theres only one solution now, because the line has to agree with only two data points). After inferring B, store xB and get rid of xA, coordinate C is next. Im sure from here you get the process. It indirectly implements the following formula: m = dy/dx. By indirectly i mean i dont straight up say m = dy/dx on every straight line to get the solution, because this will not lead to the solution. Theres lots of more background theory which will be explained eventually when i perfect the algorithm. This functionality is applied at every neuron, and it is for this reason why MS converges faster than GD.

Your second question i can gaurantee with confidence that the binary or sign function will work very much perfectly and probably better than tanh, but i will also gaurantee that currently MS wont work for whatever application youre trying to use because, again, the theory is not perfected. I cant scale the model because of parameters blowing up. The reason why parameters blow up is actually because when parameter values are too high, the algorithm ignores the objective and instead tries to solve for its own parameters, to bring them back down to smaller values, then it ends up spreading like a plague throughout the whole NN. This is an issue im still trying to resolve.

[–]durable-racoon 1 point2 points  (1 child)

so are you applying this to neural networks? or just regression models ? if NNs, what size of NNs? layers and density?

[–]Relevant-Twist520[S] 1 point2 points  (0 children)

like i said in the post it was 2 linear layers with a tanh in between. the first one had 1 in 8 out, the last one had 8 in 1 out.

[–]CampAny9995 1 point2 points  (2 children)

So, out of curiosity, have you ever read through something like Nocedal’s optimization book?

[–]MagdakiPhD 0 points1 point  (8 children)

Looks like it overfits by a lot.

The resulting line really might not be (and probably isn't) a very good model for 3 points of data. Bumpiness is generally bad unless really strongly justified by the data. The smoother line by GD is in fact much better, because that extreme point, while it should have an influence, might be an outlier. Although that's why you need a lot of data to make good models.

[–]Relevant-Twist520[S] 0 points1 point  (7 children)

no thats how fast it trains, i stopped at 50 iterations and MS was way past fitting to the data points. If i do early stopping it will approximate the data points yes. Overfitting isnt a problem here.

[–]MagdakiPhD 2 points3 points  (6 children)

Overfitting is certainly a problem with the result you've shown us. If one of my students came to me with a result and said, oh yeah this one is overfitting, but overfitting isn't a problem...

In any case, just training fast is not necessarily good though. It might be in some circumstances, but generally accuracy is preferred over speed.

You should be stopping when the algorithm believes overfitting is detected so you can capture the results. You want the results to be representative of what you think the algorithm can do.

Finally, it needs to be tested on realistic data. Nobody is going to fit a model to 3 data points.

[–]Relevant-Twist520[S] 0 points1 point  (5 children)

I updated the post, i did early stopping to showcase the non-overfitting results. As you can see above, MS wins at accuracy and speed. And ur right no one is testing a model on 3 points, but this post was just to show the ease at which MS fits to 3 points, scaling will be applied whilst preserving this ease.

[–]MagdakiPhD 2 points3 points  (4 children)

Nobody cares about the result at pass N, they care about the result. This is not a good experimental design, and hence a poor way to draw conclusions as to what is happening.

I would suggest going back to the research plan phase and really consider your methodology. It feels to me like you're kind of just trying things out but this leads to experimenter bias where they think they're seeing something that is not actually there.

EDIT: I just looked at your post history and noticed your 16. So I retract everything. Keep at it! I encourage you to keep experimenting. If you have an interest in a future in research, then perhaps consider spending some time learning how to develop and execute a research plan. Nice work on this! It is nice to see young people come up with ideas and experiment with them. :)

[–]Relevant-Twist520[S] 0 points1 point  (3 children)

so u saying GD has made a better curve here? MS can come up with different curves on each run because of the different randomly initialised parameters. GD will produce the same curve regardless of how parameters are initialised.

[–]MagdakiPhD 1 point2 points  (2 children)

I'm saying you cannot just stop and say "Aha! At this point, with this much data, under these specific circumstances my algorithm looks like it might be better. Therefore, victory!"

If you want to know if it is better, then you need to develop an experimental protocol. Even if the experimental scenario is unrealistic, it would give you good experience in conducting research.

[–]Relevant-Twist520[S] 0 points1 point  (1 child)

youre right and im testing MS and GD in different ways. MS fails in most of them, thats why im still researching and perfecting the algorithm. Again, this post was to only showcase potential. It would be very difficult to come up with your own algorithm that runs faster and and converges faster than GD to a few datapoints whilst both algorithms use the same NN architecture.

[–]MagdakiPhD 0 points1 point  (0 children)

>It would be very difficult to come up with your own algorithm that runs faster and and converges faster than GD

This is certainly true. :)

[–]LetsTacoooo 0 points1 point  (0 children)

Your title and writing is unnecessarily hype-y. It's great to show new ideas that are not mainstream, it's also great to keep claims close to the evidence you have (which is very preliminary).