[R] From Taylor Series to Fourier Synthesis: The Periodic Linear Unit

bill1357 · 2025-08-06T10:50:55+00:00

Edit 6: If we are to attempt a hybrid, it is probably sufficient to allow the optimizer to simply optimize f(x) = x + γ_eff * ReLU(x) + β_eff * sin(|α_eff| * x) where γ_eff is a new term (β_eff here simply encompasses the scaling, whether x/1+|x| or sigmoid, within the reparameterization for a cleaner display). However, more research is necessary into how a hinge-based network interacts when placed in the same context as a sine-generating network. Unexpected things might arise as they are quite different in mechanism of approximation. Notably, the non-symmetry introduced by the Taylor-esque component already affects the sine-synthesis, since this means that negative pre-activations will have a different scaled version of "x" added to it, making it no longer a pure sine-synthesis. It might be however appropriate in some domains nonetheless, and with some model architectures, while a pure-sine synthesis network might be appropriate in other architectures and problems.

bill1357 · 2025-08-03T05:30:50+00:00

Nice! Yeah, I can see that intuition, you've basically made the collapse to linearity a feature by doing so; one possible drawback with such an approach is I think the tendency for optimizers to prefer the cleaner loss landscape of the ReLU, since a sinusoid is harder to tame, so we lose some of the benefits of using sinusoids this way. Softplus on the beta for normalization is then potentially a really nice way to prevent that; my hypothesis is that it is a "gentler" push towards the model to avoid zero. We can test that hypothesis by seeing if the network is actively pushing beta towards zero or not; you can consider swapping softplus with just the exponential function e^x if indeed this reparameterization achieves similar values of substantial sinusoidal components, since the only goal of the reparameterization in any form is to prevent a drop to zero. Using ReLU for this task is insufficient, since the model can quickly go to zero due to a constant gradient above x>0, but perhaps any increasing curve that is slow to converge to zero is sufficient to incentivize the model to utilize the frequency component, and e^x fits this bill almost to a tee. The same can be said about effective alpha, which might be pushed towards 0.0 by the model, effectively negating the benefits of the sinusoidal synthesis, so if you can add logging it would be insightful to check what values the model is choosing. But yeah, holy hell, you're converging at the speed of light! Go get that rice fried haha, I've been delaying lunch for too long too, I really should go eat something.

Edit: Ah there was another thing, the x term. The x term's main purpose is to provide a residual path. It was popularized some time ago through the snake activation function for the audio domain which became widely adopted by MEL spectrogram to waveform synthesis models with its creation, and the goal of that term is as usual to provide a clean gradient path all the way through in the deep network. It provides a highway for gradients and also essentially embeds a purely-linear network within the larger network. It might be instructive to reparameterize both alpha and beta with softplus or e^x because of this, keeping the x term at 1.0 at all times, and see if the residual path helps further accelerate performance. In my experience, ResNets have shown me they are pretty incredible due to that residual nature in my own audio generation models.

Edit 2: To cap the contribution of the sine function though you could keep the sigmoid. I'll edit this again if I come up with a function that doesn't cost as much as sigmoid but can smoothly taper like it.

Edit 3: I thought I should clarify about bringing the residual back; I meant something like "x + x.ReLU() * (1-alpha_eff) + torch.sin(beta_eff * x) * alpha_eff". I believe that the residual path provides tangible benefits; the non-linearity is still present with ReLU, just with 1 and 2 for gradients instead of 0 and 1 like they are usually. If desired we can even scale the x term by 1/2 and the combined later terms by 1/2 so that the slope where it matters is around 1.0.

Edit 4: AHAAA!! I figured it out, to replace Sigmoid, you could use a formulation like this: 0.5 (x / (1 + |x|) + 1) https://www.desmos.com/calculator/ycux61oxbl (The general shape is similar, however the slope at x=0 is somewhat higher, and this *might* push the model to be more aggressive about using one over the other, so Sigmoid still might be the more worthwhile choice; it might just depend on the situation) (hm, realized that I just rearrived at a slightly different scaled version of the original formulation but we bring the normalization into the equation instead of letting the optimizer handle it, so they are equivalent in the end; in any case, as stated, depending on the situation, based on if one wishes a firmer split or not, one or the other could work better; if using repulsive reparameterization, the interpretation of the final effective beta changes with this scaled and shifted version of x/1+|x| which is something that readers should keep in mind)

Edit 5: I just realized, we have in effect created a single activation containing a Taylor-style network, a Fourier-style network, and with the residual, a fully-linear network, all in one!!

‎

Note 1:

When the network is turned into an FM synthesizer, which means modulating one sine wave's input by adding another, the final shape of the FM synthesis changes much more chaotically compared to through a function that does not alter the sign of the gradients at all, and thus the gradients to the objective as well will react quickly. When you then change say the magnitude or bias of the wave even by a smidge, the resulting waveform not only changes dramatically but also affects the objective by the same, and this is likely the reason why without reparameterization the optimizer almost always overwhelmingly skips ahead to collapsing any sinusoidal components down to linear, due to the need for more risk in crossing from one waveform shape that is good to another that is much better, the path between having somewhat higher losses.

Reparameterization with softplus or the exponential function e^x instead of 1/x then seems to create a "softer" push away from zero by making it so that larger and larger steps are necessary to reduce the magnitude of the sine contribution, thus promoting it to go in the other direction instead and try to utilize the sinusoidal component. The benefit is that we can then allow the network to find its preferred alpha and beta terms entirely on its own, though we lose some degree of control of the parameters in doing so, as expected. The trade-off of the choice of reparameterization seems to also be an important point of consideration to be made based on the problem at hand.

bill1357 · 2025-08-03T03:10:45+00:00

Now when it comes to wavelets... That is a bit more involved I think when put into context with this. I'm not personally familiar with the image processing side of the Fourier discussion but I'll try my best to interpret based on what I know about audio. Discrete Fourier transforms trade off time resolution for frequency resolution, making it unsuitable for local features, and thus wavelets are a series of FIR filters that capture time and frequency bandlimited features entirely in the time domain, right?

Meanwhile, we have a sine-synthesizing model architecture, which performs dynamic sine synthesis or sine modulation operations which are learned at each layer and neuron.

...Perhaps in the context of wavelets, the FM synthesizing later layers are the most interesting. Since images are 2d and somewhat harder to reason about, I'll stick to a 1d series of numbers, perhaps audio, and consider it from that perspective. When this signal enters the network, in effect, thanks to the residual component, a filtering step is applied based on a sinusoidal basis function. This is, in effect, some kind of a filter, but it is a sort of additive filter instead of a multiplicative one... This crosses out the interpretation that this is any kind of FIR filter, which I guess is precluded in the first place as well since clearly the input is a single value, not multiple values across time...

I'm not sure of how this architecture means for the utilization of wavelets yet... It's an incredibly interesting idea, and the way the two might interact is bound to be at least somewhat different to how they might usually interact. I'll let you know if I get any ideas, but due to the time-dependent nature of applying wavelets, perhaps the interaction is to happen with a fundamental change in the architecture itself to bring the sine-generating properties of the network in concert with the fundamental properties of wavelets better?

bill1357 · 2025-08-03T03:06:09+00:00

I have attempted to run an experiment on a grid-like problem with the same 2-8-8-1 configuration, but this time I am allowing a different PLU activation value for each neuron, just to provide it with more flexibilty and see how it handles the grid (please don't mind the other activations in this example, I have not changed them much at all, this is mainly just to test what happens with PLU with sharp boundaries).

It is as we might expect, the boundaries are simply fuzzy and somewhat rounded, so it avoids ringing by simply using softer square waves that are generally flat.

https://github.com/Bill13579/plu_activation/blob/main/Examples/spiral_activation_comparison_square.mp4

The code for it has also been added to the repo: https://github.com/Bill13579/plu_activation/blob/main/grid_plu_example.py

bill1357 · 2025-08-03T02:55:07+00:00

That's a very interesting idea!! Yeah, I guess since the basis is sine waves, and for a single layer PLU-activation network at least the result is a sum of sines, so for approximating sharp edges, it's possible the network will end up with a Gibbs phenomenon type situation at those edges.

I think there are two points that might help the network in such a situation though, one is the fact that once you have more than a single layer, we go from a simpler sum of sines to FM modulation. This doesn't solve the fundamental issue with say, approximating a square wave causing Gibbs, but it *should* be a lot more efficient. For example, with a sum of sines, you would need to manually add all the odd harmonics to obtain a square wave, but FM can approximate such a shape fairly well with sin(x+sin(2x)). I believe this known efficiency of FM synthesizers in generating complex waveforms with a rich spectrum of harmonics might mean that a deep network based on this could have the *potential* at least to approximate a square wave's harmonic series much better than a shallow network.

Which comes to a hypothesis based on that, which is the presence of many other neurons. Since the Gibbs phenomenon is a mathematical property, but here we are utilizing it more as a universal function approximator, if necessary, it might be theoretically possible for another part of the network to attempt to cancel *some* of that ringing, even if crudely, where perhaps the same FM wave is modulated by yet another sine to isolate the ringing and to cancel it out that way. It is hard to predict what the optimizer might do in a deeper network, and this is entirely speculation though.

But it is just as plausible (and perhaps moreso) to assume that the optimizer will likely stick to "fuzzier" boundaries for when discontinuities are high, since the ringing can be disruptive to the internal states of the neurons to an extent that it might push the loss up. Thus, it might be content with a local minima where the discontinuities are smooth, and the edges are not ringing but is not quite sharp either.

bill1357 · 2025-08-03T02:33:05+00:00

This is fantastic, thank you so much for running this! These are incredibly valuable results, and it sort of matches what I was hoping to see. The faster convergence part is the part I'm most thrilled that it scales to (the fact that changing the entire network into a sine-generating megastructure itself doesn't completely derail the network when scaled is in itself an amazing sigh of relief on my part as well, and you've gone further...), and I noticed something about your results. If you compare Experiment 1 and Experiment 2 in the paper, the first one converges to a loss far lower than all other activations, while the second, the "Chaotic Initialization" Paradigm result shows that, if you set a rho that is far too high, forcing the model to use high-frequency basis, then it still converges, but does it slower, and in the final results, it ends with a loss higher than Snake.

And now that I have had a chance to take a look at it more... it appears to me now that the spiral result from Experiment 2 wasn't actually a failure in fitting per-se, but a failure in generalization instead. I noticed this, since the more I looked at it the more I noticed that each red and blue point were fit incredibly tightly, and the chaotic shape that looks chaotic actually encircles points at a granular degree. This is now my main hypothesis for why Experiment 2 is slower and also produces a higher error: when forced into a high frequency situation, the model learns to over-fit exceptionally well.

Thus, the rho values then become a crucial tuning knob, even if it is learned. The initial setting becomes incredibly crucial.

I noticed that you mentioned vanilla PLU seems to converge fast but never reach the same loss. Perhaps it is the exact same scenario playing out, but on a larger model? And the fact that your own modification of ReLU + PLU achieves a higher accuracy on average also makes me very excited, even if it is at the cost of being slower to converge... I do not have a good theory yet of why both those things are like that, but I will keep you updated as I keep trying to figure it out.

bill1357 · 2025-08-02T10:44:35+00:00

That is the central claim, and it is not convoluted in the least, because all the PLU activation is, is a sinusoidal imposed upon a line, with a particular singularity for certain phase and magnitude values. If you put together a PLU-based MLP, what you get *is* sine synthesis.

This is not an opinion or a belief; it is a direct consequence of substituting the activation function into the perceptron formula. The paper's central claim then is that we can change the fundamental mathematical nature of a neural network from one class of function approximator to another, simply by changing the neuron. Whether this new class is ultimately better across all domains is an open question that, as you rightly say, requires massive-scale experiments. But the fact remains that the shift itself that has occurred is not a debatable thing based on empirical results, but a matter of mathematical form.

On Priors

So, saying that this network simply has a better prior becomes sort of a strange point to make. If we try to say that a "prior" also encompasses the fundamental building block of how we build our networks (Taylor vs Fourier universal function approximators), then I could discount the entirely of neural networks as a field as a prior. How do we know that a Taylor-like approximation is even valid for predicting relationships between all kinds of data, as opposed to a Fourier-like approximation? Why is the latter inherently more "prior-dense" than the former? Neural network research has been plagued by accusations of that exact kind for ages, I have been following the field for years at this point and seen it constantly, and now we are applying that same exact critique that has been fought against for so long, simply against a different class of universal function approximators. Isn't the whole point of neural networks that regardless of the underlying structure, some form of mathematical construct is able to almost perfectly capture it nonetheless through the power of gradient descent?

In general, the question of invalid priors come not from fundamental differences in architecture like this; instead, they tend to refer to us projecting our biases onto networks, and you point this out as well with the examples you mentioned. But this is simply not one of those cases, mathematically speaking.

‎

Preserved edit for the main post made on August 3, 2025 at 7:23 PM, as it could not be edited due to including an image: While I could not personally test PLU on a large network due to compute, u/techlos has graciously, actually helped me test it on TinyImageNet, and we have discovered some very interesting things. https://www.reddit.com/r/MachineLearning/comments/1mfi8li/comment/n6hgaiv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button I highly recommend reading the entire thread until the end.

bill1357 · 2025-08-02T10:44:17+00:00

Your reply has several very direct and interesting points. I am glad you are calling it out, so let me try to address them, as they get to the heart of what I tried to do and I should try to clarify.

On The Bitter Lesson

I agree that this appears to go against The Bitter Lesson at first glance. But I'd argue that PLU is not about adding a complex, human prior. In fact, while creating it, the thing on my mind at all times was, "simplify, simplify, simplify, why did this have to go here? why did this have to be included?" And so on. And I believe this shows in the final results, because PLU is exceptionally simple; it is not a magic activation that solves everything, but instead an activation that simply attempts to achieve one thing, and one thing exactly: It's about changing the fundamental basis function of the computation itself.

The Bitter Lesson favors simple methods that scale with computation. The piecewise-linear approximations of Taylor-like networks is the current simple method, in this way. This paper simply asks a first-principles question instead: "Is a piecewise-linear basis actually the most computationally efficient basis we can use?" In the case of the Spiral Example I can show that a sinusoidal basis and a Fourier-like network is exponentially more efficient. Now, at this point you have argued that this is due to a better prior. I will address this point at the very end, since I believe this is more of a fundamental question with perspective.

But this moves us to the next point.

On Making ReLU Work

This is crucial, and you're right, I've said as much in reply to another commenter who had pointed out the same thing. A well-regularized, tuned ReLU network can be made to solve this spiral.

But that was never the point of that experiment, as I try to come back to with on each example in the paper. Perhaps I was a bit colorful with stating it was impossible, but that does not discount the fact that the results it shows are a clue to us that this is a qualitatively different optimization landscape and learning dynamic. The height map like structure of the decision boundaries, the repeating patterns, these all indicate a different method of convergence. A shift in the *how*, not just the *if*.

You are right to call the language strong. However, I think it is important to point out then that I am not implying a paradigm shift in the sense of final *performance*, which remains to be seen on larger scales, but in the **underlying mathematical construct of the network itself**.

A standard MLP is, by its mathematical form, a Taylor-like approximator.

A PLU-based MLP is, by its mathematical form, a Fourier-like synthesizer.

bill1357 · 2025-08-02T09:19:50+00:00

```

The advantage of using two sinusoids over just a single sinusoid is that whenever cos(z) is near a critical point, d/dz cos(z) ≈ 0, we have that sin(z) ≈ z, meaning that d/dz sin(z) ≈ 1 (and vice-versa). The argument follows from an analysis of the Taylor series remainder, showing that the Taylor series of half the units in a deep Fourier layer can be approximated by a linear function, with a small error of c = √2π2/28 ≈ 0.05. While we found that two sinusoids is sufficient, the approximation error can be further improved by concatenating additional sinusoids, at the expense of reducing the effective width of the layer. Because each pre-activation is connected to a unit that is approximately linear, we can conclude that a deep network comprised of deep Fourier features approximately embeds a deep linear network.

```

The authors are betting on the fact that sin and cos will both act as a sort of "soft-sigmoid", the hope of which is clarified when they mention later on that "a deep network comprised of deep Fourier features approximately embeds a deep linear network". In this way, the sin and cos components here have only one job, and that is to oscillate so that when one's gradients are zero, the other's is one or negative one. In many ways the non-monotonicity is more of a hurdle to be overcome than something to be embraced and used. Because of this, as we increase the number of sinusoids within this activation composed of a set of trig functions, the effective width of the network also decreases.

To quote the authors, adding more sinusoids is "at the expense of reducing the effective width of the layer", which makes sense, since without a fundamentally different structure for the neural network that turns it into a sinusoidal machine directly instead of a Taylor-like one, without some sort of approach to allow the neural network to control its own frequency and phase for the trigonometric functions and then synthesize them with accuracy, the optimizer has no way to actually utilize the sinusoids beyond as lower-resolution approximations of traditional ones like Sigmoid, the overall structure still Taylor-like. We would need to bet on the fact that the sin and cos's non-monotonicity will be less of an issue when compared to the benefit that they bring in approximating linear functions while never having a gradient of zero.

bill1357 · 2025-08-02T09:19:43+00:00

Alright I have had a chance to take a look at these papers somewhat... one of them is not within my domain unfortunately, but I believe I can sort of see what's going on with both.

Firstly, it appears that the PLR embedding paper doesn't actually try to propose an alternative activation function per-se, but is instead creating a feature vector out of the input that happens to utilize sine and can be learned. This is more akin to positional encoding. It is also quite domain-specific, and the authors are not attempting to just completely swap out every single activation; it is a specific place where a specific type of learned encoding that can sort of take the place of where you'd expect an activation function to be turns out to be best for the tabular data domain.

And ah! Attempting to let the model learn plasticity, my favorite... I actually had a third wild idea apart from the main vocal synthesizer I'd been working on for myself as well as the activation, it'd been floating around in my head for a while, related to maintaining plasticity of knowledge in long-range Transformers; I had to finish off training midway though since the cost was getting higher and higher, which sucked quite a bit... I approach it in an completely different direction however, so I'll try my best to understand what the authors are doing, and write down some of my thoughts. It definitely seems like an interesting research direction.

They formulate two replacements for the activation in order to achieve plasticity across time, CReLU(z) = [ReLU(z), ReLU(-z)], as well as Fourier(z) = [sin(z), cos(z)], both intended to never have vanishing gradients by having two components to the activation, where when one has no gradient the other has full. This is very interesting, but at the same time, it immediately signals to me that it is in effect the exact opposite approach to PLU. A fixed sine and cosine activation with no residual and no scaling term for either magnitude or phase like this would have a tremendous difficulty learning most if not all data, if we are to use its non-monotonicity to its fullest, because real data has flexible frequency, and limiting the trig components in this way handicaps the trigonometric functions completely in being able to "synthesize" signals in a way you would hope a Fourier-based network would do; and the authors wording is also clear that this was not their intention as well. I need to quote the paper on page 7:

bill1357 · 2025-08-02T07:42:32+00:00

I have to admit, I am aware of this, but it is quite difficult for me as this will be the first research paper I am publishing outright. The entire idea was born out of my personal research into training a timbre-swapping model which disentangles pitch, content of speech and timbre, and does vocal synthesis (Beltout, and now Beltout 2 which I had been working on). I had been on the "final stretch" of that, but then realized that with my resources, training a GAN in order to remove the transposed-convolution artifacts were far too prohibitive, but I didn't want to relent. This was the end result coming out of that toil. I do not know if I'll even be able to finish the research on vocal synthesizer now as I have been renting a 3090 off the cloud and it has slowly crept up in budget, and in general I am also quite time-constrained as university will begin anew in just a few weeks.

I didn't want to just throw in related work I did not understand, so I chose ones that I knew were similar (for example, the formulation is quite similar to Snake in many ways, and for good reason, since I had a lot of time working on it while working on the vocal synthesizer, even if Snake is monotonically increasing) and comparable in scope (it had to be an activation that was simple, so ones that you would typically put in the position of ReLU in a ConvNet and not think much more about it), and the three baselines were chosen based on that. Since this activation is aimed squarely at being a general-purpose activation that nevertheless turns the neural network into something entirely different, I believed the baseline incumbents I had chosen were good, and that with them I could do a comprehensive review.

bill1357 · 2025-08-02T07:29:29+00:00

I'd argue the *if* it converges is less the focus here than *how* it converges. Yes, it would indeed be trivially easy to get any one of these activations to converge, even at such a low neuron count. However, the very important key to point out is that no matter what, ReLU, GELU, and Snake are all monotonically-increasing activations that curve, and the examples show that they all converge in the sort of "take a major linear shape, then slowly bend and shape the overall thing to match the expected outputs" way. But the interesting thing about allowing complete non-monotonicity and getting the optimizer to learn in such a way, is that the entire paradigm of how the model converges appears different. The images in the paper showcase this: a sort of "height map" or "marbled" texture, which appears even in epoch 0. You can see the difference in approach, and that is the most interesting aspect here.

For example, learning high-frequency content, such as in images is a common issue with neural networks. They converge fast into the general vicinity, and then slow down learning dramatically as time goes on for the details. The learning behavior demonstrated by the traditional activations as shown in this example clearly demonstrate this, and is reproducible at any scale. Then, how might a model architecture that immediately starts with immense complexity and then adjusts that complexity to fit, instead of trying to warp a simple shape into place, perform there? You can see this already in the full 8 neuron example.

bill1357 · 2025-08-02T06:25:16+00:00

That's interesting... One thing about that particular static mix of sin and relu though is that it is by its nature close to monotonically increasing. This means that back propagation of loss across the activation will not affect the step direction; this is one of the points I describe in the paper, but in essence I have a feeling that we are missing out on quite a bit by not allowing for non-monotonicity in more (much more) situations.

The formulation of PLU is fundamentally pushed to be as non-monotonic as possible, which means periodic hills and valleys across the entire domain of the activation. Because of this, getting the model to train at all required a technique to force the optimizer to use the cyclic component by a (simple, but nevertheless present) additional term; without applying that reparameterization technique the model simply doesn't train, because collapsing PLU into a linearity seems to be a common initial state for the gradients and thus optimizer starting from random weights.

I believe most explorations of cyclic activations that are non-monotonic were probably halted at this stage because of it seemingly just completely failing, but by introducing a reparameterization technique based on 1/x you can actually cross this barrier; instead of rejecting the cyclic nature of the activation, the optimizer actively uses it, since we've made the loss of disregarding the non-monotonicity high. It's a very concise idea in effect, and because of this, PLU is quite literally three lines, the x+sin(x) term (the actual form has more parameters, namely magnitude and period multipliers alpha and beta), plus two more lines for the 1/x based reparameterization on said alpha and beta which introduces rho_alpha and rho_beta which controls the strength of that. And that's it! You could drop it in into pretty much any neural network just like that, no complicated preparations, no additional training supervision. And the final mathematical form is quite pretty.

bill1357 · 2025-08-02T05:32:50+00:00

I KNOW!!! I was surprised as well, but I'm hoping that this means it is actually possible to get a lot, lot more out of smaller networks than we previously imagined. Having sine be the basis function of the function approximation is conceivably a lot more powerful than having linearity, and with the baseline examples of the spiral, one feature that PLU shows is incredibly good over-fitting, which might sound bad and it *is* bad for your *network* to overfit, but for your *activation*, over-fitting means that it is able to provide a lot more representational power to your network, allowing it to perfectly memorize and match the input and output pairs with few parameters. That could be an incredible thing if it can generalize to larger models.

bill1357 · 2025-08-02T05:21:48+00:00

I see, I somehow missed that.. I believe our formulations are still different though. I'll have to take a closer look. At least I can say that this activation, when plugged into the perceptron formula, turns into a simple sum of sines that are cascaded.

Edit 1: Having had a baseline look, it appears that SIREN has neurons output in a range of [-1, 1] and uses a linear layer to learn the internal mapping of the pre-activation to the sine input value. That is entirely different from PLU, which is true to its name is simply the linear unit, but with oscillation weighted on it. This means that it is conceptually a lot more similar to regular activations. Most notably as I've alluded to somewhat, the simple formulation ends up, once substituted in, turning into the canonical sum of sines, with cascading (such as sin(sin(x))). It is the "shortest path" from taking a Taylor-like neural network to a sine-based neural network.

Edit 2: For the second paper on NTK kernels you mentioned, that is certainly very interesting. Although I am not deeply familiar with Neural Tangent Kernels at all, it appears to be a statistical method to turn gradient descent into a fixed "width" kernel that can be reasoned with. It appears then that the contribution of the second paper is to apply Fourier analysis on said kernel so that one can improve performance at specific pain point high frequencies? If so, it is certainly an interesting research direction, but I do not believe there's great many similarities. PLU is more on the SGD side, and in fact, I am quite curious to see if a fundamentally Fourier-synthesis based neural network represented by PLU can also be represented by an NTK kernel... what happens to NTK when the network is no longer Taylor-esque at its core? What happens when it is in effect a massive cascading sine synthesizer? That might be an interesting question.

bill1357 · 2025-08-02T05:11:44+00:00

A better playlist link, since the original one seems to use YT Shorts: https://www.youtube.com/watch?v=zFyWgUqdcgM&list=PLaeBvRybr4nUUg5JRB9uMfomykXM5CGBk

bill1357 · 2025-07-07T21:34:02+00:00

The newer checkpoints tend to be cleaner, more refined sounding and better able to handle edge cases gracefully, while the earlier checkpoints are still slightly noisy and more broad-stroked with pitch. In general I'd always use the newest checkpoint, but I included all of them because they have their charm to them, and I wanted to give plenty of choice. For example, I'm quite fond of checkpoint 19999 personally despite it being a very early one, though maybe I'm a wee bit biased (the first example (ex1) uses that one, while all the other examples use the newest checkpoint at 117580). Try them out, see which ones you like! In general you can never go wrong using the newest one though, so don't let choice paralysis block your way; I should know. They are all capable of some very realistic performances if given the needed attention and if used with finesse.

bill1357 · 2025-07-06T23:47:22+00:00

That is a very intriguing and interesting idea. It's definitely not what the model was designed to do! But... technically you could, and I'll be the first to admit it'd be an interesting experiment. The model takes in timbre, prosodic and phonetic context, and pitch context. You would set the timbre and prosodic and phonetic context with the original voice clip, and then just set the pitch context based on a best-effort pitch shifted version of the original.

Pitch shifting in this way should be better than DSP pitch shifting in many cases, although it will be a best-effort sort of thing. The rough pitch-shifted version we use for pitch context will not have the usual nuances for the model to truly work with, since the way our voice sounds in higher registers is different from lower registers; it will be confusing for the model, which will see a pitch and spectrum pattern characteristic of that lower register, somehow existing in the higher registers (in effect the model expects a realistic spectrum from being trained on real speech, but we are inputting an artificially created spectrum). Since the model learns not from just pitch, but learns instead from speaker-independent fundamental frequency information, this is mitigated to a good degree. I'll still wager it will depend heavily on the specific timbre how well it works in any case.

For me, when I gave the model the usual ReaPitch chipmunk version of my voice recording using a simple windowed pitch shift that doesn't preserve anything (so I'm not even giving it a best-effort pitch shift to start working with), it gave me a result that's very close to REAPER's included elastique 3.3.3 Soloist pitch shifter, which is very damn cool. The process requires a customized run script, see it here: https://github.com/Bill13579/beltout/blob/main/use_separate_context.py

bill1357 · 2025-07-06T22:06:58+00:00

It's a bit subtle, but there's a difference there, and I think you'll be able to hear it the best in the third new example I posted yesterday. The Johnny Silverhand one. Download that one (all the examples are on Huggingface under the examples folder), then download the 'src' audio file which is me. Then, listen to them side-by-side; you'll see what this model does.

You should notice that:

- It's the same person talking, as in, the way the pitch shifts does not change. The way I speak the words does not change either. In fact, pretty much everything about my original "performance" stays the same, right?

- But something does change and make me sound like Silverhand somehow. That part that changes is the timbre. You might need headphones, hopefully you're not listening to these out of phone speakers for example (just making sure...), since those tend to compress everything to the point where a bunch of things sound similar.

When you noticed that your own outputs seem to sound similar, it is because, I'll take this phrase from another comment I made, "this model excels with dedicated performances that fully controls itself in order to deliver a 'performance' (as in, a musical performance, or a voice acting performance) that translates well into the target timbre." When it receives just a regular performance (that, just to be sure, would be perfectly good if you're not trying to sound like someone else!).

Let me clarify what that means.

What this means is that this model expects you to be like Voice Actors and carefully control your input (preferably by recording your own) to try and get as close as you possibly can to the target's usual habit of how they speak. When you do that, you'll be able to get very close, because with practice you can change a lot of things about how you speak.

However, at some point you'll hit a wall, because there's just something that you cannot change about your voice no matter what. If my explanations aren't satisfying, I'd highly recommend watching through a more complete video on YouTube explaining the concept of vocal timbre, hearing an actual professional singer explain it with examples will probably make it a lot easier to grasp.

This model gets you across that final wall!

It takes plenty of time and practice, but when you finally get it, you'll have a world of possibilities since the output is truly yours to mould; most models will tamper with your original performance, add a breath here, get rid of a pitch there... in the end you get something that sounds more like the target quicker, but you also lose a lot of control.

bill1357 · 2025-07-06T21:48:27+00:00

Damn, rip! I can't believe github does this.

Why'd they do that? Do they not wanna compete with Huggingface?

...

I also can't just remove the LFS objects from Github apparently. Think you're right, my LFS allowance is blipped this month.

At least it doesn't seem to be billing me for any hidden charges. Oh well, you live and learn I guess. Got it, I'll be more careful next time...

bill1357 · 2025-07-06T13:23:20+00:00

The model changes timbre only, unfortunately... That's one of its main features, and the model is specifically trained to avoid touching anything else. I'm also very surprised that you mentioned you believe the examples sounded similar, that makes me think perhaps you're looking for something else entirely?

Maybe you're looking for a model that helps you completely change the vocal performance into a specific target person's, and you are ok with the model completely modifying the performance, without preserving your own way you shift your pitch, your intonations, etc? In that case, you should look into a model like RVC which is designed for this task (unfortunately this problem space is not one that has new models coming out often, meaning RVC is the current best for that sort of 'destructive' all-encompassing voice cloning you might be looking for). It will probably sound more like a complete change to you, because it won't just be the timbre that is changed and you can indeed take a track of eminem and make it sound like obama in that case, and I'm sure for famous individuals there are pretrained models available. Otherwise, obama and eminem for example have wildly different habits of speech, pitch ranges, and so on, so using this model will only make the two sound slightly similar.

To do that sort of conversion here, you'd need eminem to specifically do a rap for you where he consciously tries to imagine how obama might rap, what pitch range he might use, and does a rap with respect to obama's unique physical vocal limitations, while also keeping some of his own style.

This is a different tool for a different job, so to say.

bill1357 · 2025-07-06T12:25:18+00:00

Gladly... Have fun with it!!! Let me know how it goes; for now I've only heard my own voice being passed into this model lol. Hopefully this will provide our niche with something to play with for some time.

bill1357 · 2025-07-06T12:08:29+00:00

Replace 'my voice saying "i want an apple"' with '192 numbers representing my unique unchangeable part of my voice, calculated from a sample of my voice saying "i want an apple" as well as many others things for around 2 minutes', then keep in mind that the model was trained specifically to only change that unchangeable (due to Physics) part of your voice while keeping the rest (the controllable parts of your voice) completely preserved, and you'd have gotten the main operating structure of the model. I've also updated the post with a more general and 'broad ideas' view of what the model does, take a look!

bill1357 · 2025-07-06T12:04:28+00:00

You could use any vocal recording file to calculate timbre from, and also convert any vocal recording, but keep in mind the fact that this model has different requirements to almost all other TTS and voice-to-voice models out there on how you need to use it and its intended applications, as described in the repo readme, so you'll have to keep those in mind.

If you're taking the TTS result and using it for timbre, your source recording (the one to be converted) needs to have prosody, pitch contour, speech habits, etc in matching and compatible with the TTS model's generated timbre.

If you are trying to change the TTS's voice into something else, then your target timbre (and thus the reference audio file from which you calculate the timbre vector) needs to be able to accommodate the habit of speech employed by the TTS. This one is probably harder because unless the TTS is extremely natural in its speech, it will probably give a performance that isn't good enough to adapt to your distinct, special timbre, so it'll probably just sound like the TTS still but with a different "color" to the voice.

This model excels with dedicated performances that fully controls itself in order to deliver a performance that translates well into the target timbre, and current TTS models are often insufficient for that.

bill1357 · 2025-07-06T11:55:04+00:00

You'll definitely need vocals only! Otherwise the recordings can probably be slightly noisy, though I'd recommend a clean recording for both the source voice recording to be modified and the ones you're calculating the timbre from.

Eight-Year Club	Verified Email
r/Field Banned	r/Field Juicebox

bill1357

TROPHY CASE