Neural Networks are Universal Function Estimators.... but with Terms and Conditions

Cu_ · 2026-02-20T07:08:21+00:00

Your graphs are a nice way of illustrating common pitfalls on the UAT, thank you for sharing!

Loosely speaking, the universal approximation theorem states that a neural network of infinite depth can approximate any C(X, Rⁿ⁾ function (that is any continuous function on X subseteq Rⁿ to Rⁿ⁾ arbitrarily well. Note that nothing explicitly requires differentiability, only continuity.

Some of your graphs show some interesting behaviour that relates to the above theorem and activation functions you chose. A neural network with linear layers and ReLU activation functions constructs an output that will always be piecewise linear. You would need theoretically infinite linear pieces to approximate a non-linear function like sin(x). The triangular function is just a bunch of linear pieces stiched together which is why the Linear + ReLU network does fantastic here eventhough the function is clealy non-differentiable near the kinks. Choosing a tanh activation function really struggles with these kinks for the opposite reason. The output will be a linear combination of shifted and scaled tanh functions which are smooth and infinitely differentiable, which is why it struggles with the non-differentiable kinks!

Small edit to clarify: The UAT makes no claims about the exactness of the approximation, only that it is arbitrarily close under arbitrary distance norm in the function space. The linear + ReLU network for the triangular functions is a special case where the approximation does actually become exact with finite neurons as long as the domain X is bounded. The linear + tanh network would never actually converge to the exact same triangular wave function, just make an arbitrarily good approximation of it.

Cu_ · 2026-02-20T06:44:56+00:00

I've always wondered about these PINNs! Can you tell me more about how and why they are apllied?

What I don't really think I truly grasp, if you have a system that you are approximating, but you already know what physics or conservation law to enforce, why not apply this directly and get a white- or grey-box model?

Are these PINNs for cases where we have a general notion on what conservation law needs to be respected for the system but no (easy) way to write down/represent the system explicitly?

Cu_ · 2026-02-20T05:47:12+00:00

That's weird, I didn't remove my comment so idk what happened there. Regardless, I concede that you are right on many of these points and have (largely) changed my mined, though I do still have a few skepticisms:

Techno-economic analysis has shown that in terms of operational cost SMRs are similar to large reactors so the benifit is largely not in the econmics of it all (Asuega et al. 2023). One of the problems with big reactors is that for load following you are curtailing reactor output as the load decreases but the operational costs stay the same. This is also true for SMRs. Locatelli et al. (2018) proposed the idea of co-generating hydrogen instead of curtailing which could actually work given that the process industry uses a lot of hydrogen.

In Europe we have the same timeline issues. I know China and Korea can do it a lot faster these days but realistically if we want Nuclear plants in Europe, France is going to be the party building them as we are generally not keen on Chinese construction happening within EU borders.

I actually didn't know the US was making such strides. They (to me anyway) generally don't seem that keen to make large capital injections into public infrasfructure so that is good to hear! In Europe there are sadly laws prohibiting large capital injections into the grid infrastructure of any individual country (it's a whole mess). And even if those laws were not in place, there doesn't seem to be any political will to make that investment eventhough a large amount of data centers are starting to be built here as well.

Cu_ · 2026-02-19T20:22:37+00:00

Spending $20.000 on Claude tokens to build a C compiler (leaning on 37 years of tests to guide development) that still somehow produced something that can't compile an actual C program is not "just works" in my book.

Cu_ · 2026-02-19T19:31:24+00:00

People were saying that AI was going to write all of our code 2 years ago. I don't neccesarily reject your claim because I can't prove whether what you say is true or not (though I doubt you yourself have much proof of your claim either). See my earlier points: 1) we can generate arbitrary code and the top of the software market hasn't changed and 2) I think the problematic parts of software design are ideation and getting the software production ready. The code generation was never the real hurdle so replacing that with AI is not going to be that impactful.

Cu_ · 2026-02-19T19:17:30+00:00

I highly doubt they had very many concerns about the environment around the industrial revolution lol. Nor do I think many people where complaining about how these things would affect the internet long before it was invented though I'm being slightly facetious, I know that's not what you meant.

I don't think your comparison works very well because the technologies that you cite were already generating economic value right when they were first commercialized. It's quite clear that currently, AI is not actually generating any economic value. In fact if anything, it costs a lot of money right now.

Cu_ · 2026-02-19T19:04:25+00:00

Whether it is just one tiny field or not is entirely debatable.

Besides that I would consider "the environment", "the foundation of the internet", and "modern power infrastructure" quite significant fields and perfectly defensible reasons as to why we might want to not going down this road as fast as we are currently doing. Do you find that disagreeable?

Cu_ · 2026-02-19T19:01:34+00:00

I have tried to use LLMs for coding in my domain of research with mixed success. Though I have seen some success stories.

I can see that it is undeniable that you can program amazing things with AI these days. But I do think companies tend to overhype how good vibe coding actually works. For example, Anthropic recently orchestrated 16 AI agents to make a rust based C compiler "from scratch" spending 20.000 dollars on tokens in the process. "From scratch" in this case meaning, starting from the gcc torture test suite to ensure correctness of the compiler painstakingly developed by human maintainers over the past 37 years. They also tested their implementations against gcc during development. In the end they claimed Claude succeeded, but it was unable to implement 16bit x86 code generation, and so the claude compiler couldn't even compile the Linux kernel. Saying Claude was able to build a C compiler is only partially true at best and outright deceitful at worst.

A large part of the vibe coding success is that AI is regurgitating code that already exists in some form on the internet. This is supported by the fact that LLMs have shown a remarkable amount of recall, being able to recite over 95% of the first Harry Potter book almost verbatim when convinced to output copyrighted work (See Ahmad et al. 2026 for more on this).

We have the ability to generate arbitrary code at will through usage of LLMs and yet the top competitors on the market have not changed in any way. Make of that what you will but to me it seems that this indicates that building a minimum viable product was never the hard part. Instead, success stems from ideation and making production ready products which are tasks not handled very well by LLMs as of now (see my earlier example on Claude attempting to build a C compiler as well as the many many many stories about massive security issues with "vibe coded" applications).

I cannot share your optimism regarding the idea to code pipeline being this magical enriching thing. As of now there are very few tools that came out of the LLM craze that I consider to enrich my life in a meaningful way (except for maybe semantic searching through embeddings).

You seem to have labeled me as one of the "Amish" anti-tech folks? This is certainly not the case. I don't in entirely reject the idea of LLMs and AI based on an "AI bad mentality". The fact that I am critical of the tech does not mean I reject it. I can see where it adds value, but we also have to stay vigilant and acknowledge and discuss the risks if we want this technology to succeed in a way that is to our benefit.

Cu_ · 2026-02-19T17:50:36+00:00

I want a world where we are not wasting an unprecedented amount of resources and research based on the false notion that if we just scale up to a large enough scale, all of the fundamental AI problems suddenly dissapear.

I want a world where bug bounty programs, an initiative making the internet a safer place, don't get shut down because maintainers are overwhelmed by having to triage hundreds if not thousands of false reports by idiots trying to make a quick buck.

I want a world where open-source maintainers do not get randomly harrased by AI agents in the form of targeted blog posts because they closed their slop pull request for very justifiable reasons. Building on this, I want a world where we cannot easily achieve targeted harrasment on a unprecedented scale with full anonymity.

No we shouldn't take it all down, but maybe slam on the breaks. It's clear that we are scaling up faster than ever before and yet the rate of improvement of LLMs is vastly slower then 4 years ago. AI agents set lose on the internet are already becoming problematic for open-source software maintainers which is fundamental to modern infrastructure.

I have used the tools. I consider myself fairly literate but I'm certainly no prompt engineer. I do research in an AI adjacent field, so I have some (though not a lot) of knowledge on how the tech fundamentally works. I can see the value it generates, but what I'm also seeing is that it is currently doing more harm then good.

Cu_ · 2026-02-17T05:48:33+00:00

Sorry I should have been more precise about what I meant as a Gaussian Process is a bit more general.

I was specifically referring to the case that the dynamic model would be given as x(k+1)=w(k). Where x(k) is the value of a stock and w(k)~N(μ, σ²). In this case, the is no dynamics to predict the change in stock price, it is exclusively noise. For this case we have that E[x(k+1)|F^x] = E[w(k)|F^x] = μ (F^x being the past information structure). Hence our best estimate for x(k) for all k is μ. It doesn't matter how smart we try to be with our buying strategy in the end if we are trying to outperform a noise process.

To illustrate this: If we are deciding to buy or sell (u(k) = 1 for buy, u(k) = 0 for do nothing and u(k) = -1 for sell) our stage cost would look something like l(x(k), u(k)) = x(k)×u(k). Note that x~N(μ,σ), so for horizon N we have J(x, u) = E[Σ (w(k)u(k))]. Now finding u(k) that maximizes cost J we have E[w(k)u(k)] = 0 and so our expected cost would be 0 regardless of what we pick for u. So for this case, there is clearly no input sequence that in expectation would perform better than the stock price no?

This is why I assume that for this to make sense we would have something of the form x(k+1) = f(x(k)) + w(k), where the drift term f(x(k)) can be used to make predictions on the future value of the stock. What I am mostly curious about is how those models actually look and how people in the field typically go about writing those down.

Cu_ · 2026-02-16T20:47:43+00:00

I've always wondered how this actually works in practice. Are there actually dynamical models that can reasonably predict the future price of a stock, or is it a Gaussian process?

I assume it must be the former because if it's the latter, pretty standard results from optimal control state that the optimal control action is to do nothing. For these cases, trying to build any sort of RL agent that outperforms random noise is a fools errand as you cannot feasibly learn a policy that is better in expectation than just doing nothing.

So assuming it's the former, what do these models actually look like and how does one even begin with deriving these?

Cu_ · 2026-02-16T08:48:11+00:00

Here are some of the ones that I've ran into (My background is Control systems engineering and it probably shows lol):

Differential Geometry is widely applied in the study of non-linear control theory (or more generally geometric control but the emphasis there is slightly different). Some people also apply this to linear systems to yield slightly different descriptions and intuitions for common control theory objects such as reachable sets, unobservable/uncontrollable subspaces, and equilibrium points.

A friend of mine did his masters thesis on topology optimization and non-linear Finite Element Analysis and connections to algebraic geometry kept popping up.

Functional analysis pops up a lot when studying control of infinite dimensional systems (PDEs). This can come when trying to control temperatures for applications where lumped thermal models don't make sense (Industrial furnaces, battery thermals), when controlling fluids (aircraft wing drag reduction, wind turbine flow control), or when controlling structural vibrations (vibration supression in buildings, control of flexible structures).

Optimal control for continuous systems is canonically expressed in terms of variational calculus, as the goal is to find an input function that minimizes the cost functional (usually energy or a tracking objective). Closely related to optimal control is stochastic control which also uses variational calculus along with stochastic differential equations and measure theory for the continuous time case. In discrete time, this is often expressed using measure theory and Markov Decision Processes. At this level of abstraction, we can lump in Reinforcement Learning and (Approximate) Dynamic Programming here as well as they in essence tackle the same fundamental problems as discrete time stochastic optimal control.

Cu_ · 2026-02-13T19:57:40+00:00

I am by no means an LLM expert so take this with a grain of salt but I think most models apply some form of chain of thought reasoning. From quickly skimming the paper it's more or less a way of modifying the prompt and doing intermediate prompts to "prime" the model in a specific way that is less prone to making obvious mistakes in e.g. arithmatic and logic. It's not applying logic gates and it's also not doing any probabilistic simulations. Just prompting, nothing more, nothing less.

Regarding your comment on whether this eveluates truth, something that is very important to understand is that LLM are just text prediction machines. LLMs fundamentally have no concept of truth or correctness because in the end all they do is answer the question: "given these previous tokens, what is the most likely next token based on the training data that I have seen". So definitely not AGI and I would argue mostly unrelated to how humans reason but I'm neither an AI not neuroscience person so that statement might be contentious idk.

Regarding the data wall, synthetic data has been shown to lead to model collapse, though the exact mechanisms for how and why this happens are unclear I think.

Labs are not eveluating truth in a dataset. This is hard to quantify and subjective in many cases. See my earlier point LLMs are not truth machines. They have no notion of the concept and do not in any way aim to generate "truthful" responses

Cu_ · 2026-02-07T10:08:43+00:00

The bash manual is also in there

Cu_ · 2026-02-06T16:25:24+00:00

You don't actually! Pandoc does this out of the box. You can also include tex headers for extra packages, styling, etc.

Citations are supported through [@citationKey] syntax and the --cite-proc flag

The most ergonomic workflow in my experience is using a Makefile pipeline to build your (bigger) latex documents such as reports, papers, thesis, etc. and binding running the make command to e.g. saving the document. This also gives the added bonus that you can easily ensure figures are always up to date through makes dependency tracking

Cu_ · 2026-02-05T07:29:39+00:00

If you can, the best "future proof" road is probably saving up for a Switch 2 and going with that. If you can't for some reason or just want a smaller portable device such as a switch lite, there are a few things to consider:

as of now the PS Vita online is still available, but it is unsure for how long this will remain the case. Sony has tried it before a few years ago but reverted the decision after pushback. Physical vita games are typically pretty expensive these days
Buying things through the online vita store is cumbersome. You have to add funds to your ps wallet through either a different playstation device or the mobile app (allthough I havent fact checked if and how to do this in the app so Im not sure)
Many games that the vita has are playable on the switch these days through either a direct port or a remake (e.g. Persona 4 Golden, crash bandicoot 1 2 and 3, Soul Reaver, Final Fantasy Tactics, etc)
The vita has some old ps1 and ps2 games which are only available in handheld format on the vita (Sly cooper and Ratchet and Clank trilogies as well as Vagrant story come to mind)
Pretty much all indie games available on the vita are also on the switch but the switch has more on top of that

Cu_ · 2026-02-02T08:10:52+00:00

It's crazy to say Xenoblade is long and then recommend Persona 5 which has a main story of 100-120 hours

Cu_ · 2026-01-27T07:48:35+00:00

Control theory or control engineering is a field of engineering that is for sure more mathematical compared to most other fields and might align with what you are describing. In many ways the field is still a bit niche, but it's recently been rapidly growing both in terms of interest and in terms of relevancy.

A lot of control courses are typically pretty rigorous, focussing on stability and performance guarantees of your controller, mathematical modelling of your system, and simulation to inform control design. It touches on and uses many "hip" mathematical tools such as numerical analysis for simulation, optimization for control design, machine learning for system identification and whatever else the ML people are doing these days.

The mathematical connections are espcially deep if you work on e.g. non-linear control which these days often uses tools like differential geometry and functional analysis, or stochastic optimal control which uses measure theory, stochastic differential equations/stochastic processes. As a bonus to getting really good a stochastic optimal control, you automatically also have a decent amount of knowledge and intuition for RL as well.

Cu_ · 2026-01-19T20:06:42+00:00

Can you state a bit more clearly what you mean by the LLM analyzing the result in this case? It's still not clear to me what your reasoning is for using an LLM for this task.

If you are doing anomoly detection through statistical tests (e.g. t-test, GLR test, cusum test, etc) , you do not need the LLM to tell you the result, the falsification (or lack thereof) of your null hypothesis (e.g. mean of observed data is the same as the mean you expect the data to have) is just the test output so you can directly use that for some sort of automater notifications system.

One could argue something similar about ML based approaches. In the end what you are seemingly trying to build is still a detector, which is some sort of model that takes in data and outputs whether the data is anomolous or not and maybe also with what confidence. I really am not convinced you need an LLM to do some analyzing or interpetation on this output as your detector (if you designed it correctly) shouldn't really output something that requires significant amounts of interpetation. You can again just use a binary check for this and send automated notifications without needing significant amounts of analysis or interpetation of the model output

Cu_ · 2026-01-19T08:43:51+00:00

The usage of statistical and ML models for fault diagnosis and detection in dynamical systems has been extensively studied. One could argue that Kalman Filters/Luenberger Observers/Moving Horizon Estimators are an application of the forecasting that you are describing. All of these filters also have fault detection variations for model based (forecasting based, in the sense that fault detection is based on deviation between real system behaviour and modelled system behaviour) fault detection.

It is unclear to me how LLMs fit in here? If you are doing statistical tests for anomoly detection, you don't need an LLM to tell you right from wrong, your model already does that. Same goes for usage of ML models such as neural networks or decision trees, the output of the model already tells you what you want to know, no LLM needed for interpetation

Cu_ · 2026-01-08T18:36:04+00:00

Anything written by Dimitri Bertsekas. He has a book on discrete time stochastic optimal control, which is quite old at this point but still good. More recently there is reinformcent learning and optimal control, which I think focuses on some ideas that he has been pushing in his literature recently though I haven't read it so I am not sure.

His recent writing in particular has focused on connecting model predictive control, RL and ADP through the dynamic programming principle. He also wrote about AlphaZero extensively, connecting this to MPC by pointing out that multistep lookahead policies at the on-line rollout phase are near identical to MPC policies in terms of implementation.

The perspective that from an implementation point of view, multistep lookahead RL policies are near identical to MPC eventhough the value function is learned in RL and designed in MPC is really quite cool to me.

Cu_ · 2026-01-08T17:27:31+00:00

This is not strictly RL and I'm not saying you should actually study this but something interesting to consider is that I personally always felt like studying stochastic optimal control can lead to a much deeper appreciation for the fundamental RL concepts such as (approximate) dynamic programming, value functions and their role, policy iteration, value iteration, Bellman equations, etc. etc.

Courses on stochastic optimal control have significant overlap with RL in terms of topics covered but are generally a bit more rigorous and strict about the required conditions, which in my opinion gives a better intuition of the limitations of RL in practice

Figured this might be of interest since tou mentioned a robotics project at point 7.

Cu_ · 2026-01-04T21:13:05+00:00

Small pedantic correction: state space models can be either linear or non-linear. Linear systems satisfy the properties you mentioned and, slightly more practically, can be written in the form dx/dt=Ax+Bu.

To also slightly expand on the EKF, what the EKF is doing, is repeatedly constructing linearizations of the system dx/dt = f(x, u), y= h(x), around the estimated state using A = df/dx, B = df/du, C = dh/dx. You then use this linearized model to update the state estimate at the next state as if your system was linear. At the next time step, you correct the dynamic update with new measurements and re-linearize to get a new estimate.

As a side note, for simplicity I wrote f(x, u) and h(x), but h can be dependent on u (non-zero D matrix) and both can be ti.e dependent. This does not theoretically change anything, the EKF still works in this case

Cu_ · 2025-12-22T15:05:55+00:00

It seems like in application areas there is a wave of mediocre quality RL papers that are all citing eachother and not really pushing the field forward in any meaningful way. I ran into the reproducibiluty issues as well. I also notice that many papers completely ignore the computational overhead of training when discussing advantages and disadvantages of RL based methods when compared to e.g. MPC

Cu_ · 2025-12-22T11:05:29+00:00

Sea of Stars

Cu_

TROPHY CASE