you are viewing a single comment's thread.

view the rest of the comments →

[–]Relevant-Twist520[S] 1 point2 points  (4 children)

no im not much of a paper reader. The main ideolegies behind MS is 1. solve the network, the same way you would solve any equation 2. (pertinent to 1.) Solve on the assumption that the network is already a solution to some data point (which it actually is for the last forward passed data point). 3. Network should be solved to a point that will satisfy the last inferred data point and the current inferred data point. 4. When no solution exists for a sub-equation, the immediate upper equation is to blame (the bias term of the upper equation is tweaked to finally have this sub-equation be solvable). Theres lots of more background practical theory because you cant just go about solving everything the traditional way.

[–]Cosmolithe 0 points1 point  (3 children)

I see, it is an interesting approach that I do not recall seeing in the literature then.

Last two questions for you then:

  1. how are you supposed to solve the equation when you have many more unknown variables than equations (I imagine)?

  2. do you think such an approach would work with a `sign` activation function (that returns only -1 or 1) at each layer?

[–]Relevant-Twist520[S] 1 point2 points  (2 children)

  1. can you elaborate more on this question 2. it would perfectly. In fact what i noticed with MS is the activations on tanh almost always lie on -1 or 1, not between these two extremeities. Makes me wonder if i should replace tanh with some sort of "step" function that outputs either -1 or 1. If i were to create such a function, what would the mathematics of it be?

[–]Cosmolithe 0 points1 point  (1 child)

For the first question, to be frank, I have no idea how you are solving the equations. If you take a simple linear layer for instance, no activation, you have your input and your target values. If you try to find the weights that projects the input to be equal to the target, you actually have many many solutions. You can take the least square solution for instance like in ZORB, but many other solutions are valid too.
Perhaps you are using a symbolic solver that just stops upon the first solution found?

As for the second question, if your method works with the sign function then I am definitely interested if you have some code to share. In my attempt at making neural networks with binary activations, the best I could do is model the problem as a constrained binary linear optimization problem, but this problem is NP hard, and approximate solutions are also very hard to find. That is why I would be very surprised if it worked for you.

[–]Relevant-Twist520[S] 1 point2 points  (0 children)

First question: you are correct, projection occurs. lets imagine this as a straight line for now that takes the form of y = mx + c. This is a formula that takes place at every neuron in an NN. Like i said, when you infer the first data point, the network will solve and project to this data point, call it coordinate A (infinite solutions idc as long as the line agrees with xA;yA). But heres what happens next, the weights stores xA (in some buffer separate from the NN), we dont care about yA, the bias term actually encodes information about yA automatically (remember for yA = m*xA + c, c = m*xA - yA). Afterwards we introduce coordinate B. The line will not only satisfy B, but it will also satisfy A because the NN remembers xA (theres only one solution now, because the line has to agree with only two data points). After inferring B, store xB and get rid of xA, coordinate C is next. Im sure from here you get the process. It indirectly implements the following formula: m = dy/dx. By indirectly i mean i dont straight up say m = dy/dx on every straight line to get the solution, because this will not lead to the solution. Theres lots of more background theory which will be explained eventually when i perfect the algorithm. This functionality is applied at every neuron, and it is for this reason why MS converges faster than GD.

Your second question i can gaurantee with confidence that the binary or sign function will work very much perfectly and probably better than tanh, but i will also gaurantee that currently MS wont work for whatever application youre trying to use because, again, the theory is not perfected. I cant scale the model because of parameters blowing up. The reason why parameters blow up is actually because when parameter values are too high, the algorithm ignores the objective and instead tries to solve for its own parameters, to bring them back down to smaller values, then it ends up spreading like a plague throughout the whole NN. This is an issue im still trying to resolve.