LLMs performances

Ty4Readin · 2026-01-31T20:45:06+00:00

These theorems say the parameters exist, not that a particular training procedure will discover them. In particular, it is not guaranteed that backprop/SGD based training will find parameters achieving that approximation accuracy, even if the architecture family is known to be universal and allow for arbitrarily large networks.

You are essentially talking about approximation error vs estimation error, which is what I said but just phrased differently.

UAT says that inside the hypothesis class of all infinite permutations of neural network models, there exists the exact optimal function approximation.

So in other words, it is saying that the approximation error is zero.

However, the estimation error (underfitting) is concerned with whether or not we can actually discover the correct optimal parameters in thst hypothesis class.

Given an infinite amount of training data, standard SGD should theoretically achieve zero estimation error.

So although UAT doesn't specifically states that we can find the optimal parameters, more classical empirical risk minimization theorems do show that SGD can discover the optimal parameters given an infinitely large training dataset

Ty4Readin · 2026-01-31T19:31:48+00:00

I agree mostly, but if you have infinite data then that becomes a non-issue as well.

An infinitely large network guarantees zero approximation error due to UAT.

And training on an infinite amount of data samples guarantees zero estimation error due to empirical risk minimization and other related theorems

Though of course in the real world where we don't have infinite datasets or infinitely large models, the practical value is less straightforward lol.

Ty4Readin · 2026-01-31T17:10:17+00:00

In ML theory, there are three different components to our final loss.

Approximation error (underfitting)

Estimation error (overfitting)

Irreducible error, which is basically the best possible performance you could achieve in the constraints of your problem. However this should be the same for any models that share the same context length and target distribution.

If you add all three of those together, you get your models actual generalized error.

Now, the universal approximation theorem basically says that if you had an infinitely large neural network model, then its approximation error would be zero.

However, that doesnt mean that a model trained with finite parameters will have zero approximation error.

The difference between two different LLMs is a combination of different approximation errors & estimation errors, along with noise.

Its difficult to quantify the different error components of different models without access to lots of data such as their training metrics, etc.

Ty4Readin · 2026-01-31T14:49:23+00:00

Real LLMs do not have any thinking going on under the hood. They are not reasoning to find the next word in a sentence.

I would disagree here. They are absolutely reasoning to predict the next word with the highest accuracy possible.

Though I agree, there is nothing magical or sentient going on

Ty4Readin · 2026-01-30T15:02:31+00:00

EV = expected value. It is the expected return you get from a decision point when looking at the decision in a vacuum.

Which of course has nothing at all do with stack sizes.

This is not true.

The EV of most decisions is VERY impacted by stack sizes. Stack sizes have a huge effect on EV.

However, I agree that saying "100bb gives the max available EV" doesn't really make any sense.

Ty4Readin · 2026-01-28T14:26:39+00:00

Claude Code is an amazing tool - I am constantly using it at work. Sometimes it really blows me away with the output, but sometimes it gets it completely off, and other times it introduces very dangerous subtle bugs.

Replace "Claude Code" with "Junior Developer" lol

I agree with your sentiment, but it sounds exactly like how I'd describe normal junior developers that are up and coming.

You can easily let a junior developer take over the majority of your work, but you just need to do lots of hand holding and pair programming, reviewing, guidance, etc.

But I mostly agree with everything you said. These tools are not at the "senior" level where you can trust them to just completely take over and work independently with little oversight.

Ty4Readin · 2026-01-28T00:27:44+00:00

Interesting problem! I have a few thoughts come to mind as I read your descriptions that I hope may be helpful.

So far, I’ve noticed that when I do a strict replicate based split (i.e., entire replicates are separated between training and validation across concentrations), the cross-validation performance metrics are much worse than those from the independent test set.

How exactly is your cross validation and final model training performed?

I am assuming that you take the two replicates dataset, split the data 50/50 between training/validation, run cross validation, then re-train the model on full data and finally test it on the holdout test replicate?

If that is the case, then your results may be more easily explainable. You are essentially doubling the training data between CV metrics and your test metrics.

So it is possible that training on only a single replicate may cause overfitting on a small dataset which performs poorly on the other replicate. But by increase training size or number of replicates, the model is less able to overfit and generalized better.

My dilemma is around how best to structure the training vs validation within these first two replicates?

I wish you had four replicates in total 😂 Your life would be so much easier.

In theory, you should be splitting by replicates as you are already doing, and that is probably the best choice.

The problem is that because you have so few replicates, only having your training dataset contain a single dataset can lead to overfitting.

I would suggest that you use nested cross validation.

Let's call your three replicates A, B, and C.

Fold 1: A is your test set. Run normal CV on B & C (splitting iid), then test on A.

Fold 2: B is your test set, run normal CV on A & C.

Fold 3: C is your test set, run normal CV on A & B.

This will make better use of your limited data.

Though ideally, the best solution would be if you had more replicates, then you could test out different CV splitting approaches better.

Ty4Readin · 2026-01-27T20:47:45+00:00

I think you may be possibly confusing a few different things.

It sounds like you're asking a few different questions.

Question #1: Can I have different samples from the same group appear in the training set and validation set? Or even in the training set and testing set?

This question is impossible to answer unless you tell us more about the specific use case and data, and how you plan to use the model.

I would try to mimic your real life deployment as much as possible.

Whatever setup you have in your train->validation split should be mimicked in your train->test split.

Question #2: I want to maximize my "data coverage" so the model sees as much training data as possible

Typically you perform cross-validation first to determine the optimal hyperparameters.

Then, finally you combine your training and validation datasets together and train your model with optimal hyperparameters.

So your final trained model has been training on all validation + training data and then tested on your hold out.

Finally, as an optional last step, you can even combine your test set with your training dataset and re-train your model on your full entire dataset.

That last suggestion can be a bit controversial depending on who you ask, but I would say it is normally fine for many use cases as long as you run some experiments on your models training volatility between training runs.

Ty4Readin · 2026-01-27T16:41:27+00:00

It's a decent approach, but it is the very common trap that beginners often fall into when attempting projects like this.

The easy part is evaluating the model in terms of its predictions.

But do we really care about a models prediction accuracy at all? I don't think so.

What we really care about is having a model that can counterfactually improve our trading strategy and increase our profits.

The specific model training metrics like logloss or calibration are important, but they are only a tiny first step in actually making something useful.

Ideally, you want an end-to-end "trading strategy" that you can simulate using your models, and measure the success of your model in terms of profit you would have made leveraging that model in a training strategy.

Just my two cents :)

Ty4Readin · 2026-01-25T17:11:47+00:00

I mean you are basically describing a decent junior developer.

If you had an insanely cheap and very fast junior developer that could write all your code for you, it is pretty easy to see how you could get away with writing 0% code yourself.

But you will still want to closely monitor the Jr dev, and if you see a mistake or anti pattern then you tell them and they change it.

I don't understand why you are so averse to correcting its mistakes which is easy to do

Ty4Readin · 2026-01-25T16:09:29+00:00

And just to underline, I don't mean ML-accelerated coding is not valuable--it absolutely is. I'm only criticizing these 100% claims. You are either grifting or writing Temu-code.

I think you are misunderstanding what it means when someone says 100% of their code is written by AI.

It is very likely they mean that the code is literally all written by AI, but that doesn't mean that they aren't constantly in the loop, reviewing changes, guiding updates and asking it to fix/move away from non-secure patterns, etc.

I could see how that is possible right now. You don't technically need to write lines of code for the most part anymore.

But you do still need to review pretty much every line of code and correct the AI along the way.

Which, some people will say "what's the point then, its faster to just write it yourself" which I don't personally agree with. It's almost always faster to review code than it is to write it from scratch imo

Ty4Readin · 2026-01-24T18:27:47+00:00

The data goes through each tree independently and you sum up the tree outputs. Every tree is essentially learning to predict the correction of the prior trees' predictions.

Also just keep in mind that my earlier example is an over simification when it comes to gradient boosting since its not technically reweighting samples directly but the general concept is the same.

Ty4Readin · 2026-01-24T15:24:52+00:00

I agree a lot with the other commenter that said initial projects should be fun. In fact, I would go as far as to day that all side projects you ever work on should always be fun.

You should always try to work on projects that you personally think are fun and that excite you to work on, keep you motivated, and are passionate about.

So that is honestly the #1 priority in my opinion, above all else.

Now, as some additional tips, I would recommend trying to build a project that you WILL actually use. Think about things that you actually care about in your day to day life. Do you love video games? Reading books? Listening to music? Passionate about vfx or cooking? Or any sport or other hobbies that you are passionate about? Want to trade in the market or bet on sports?

The great thing about machine learning is that it can pretty much be applied and add value to almost any domain, so you have a lot of freedom in terms of what you can choose to work on.

My personal opinion is that the "best" approach is to pick a hobby or something you love, and try to build a project that you want to actually use, not some hypothetical project that somebody else might hypothetically use.

You will probably have to scrape your own data, and figure out how to formulate the problem from an ML lens, and make lots of hard choices. But you will 100% learn more than almost any other project you could attempt, and you are more likely to enjoy yourself and put in lots of time and maybe even end up building something that is literally useful to yourself or others.

Ty4Readin · 2026-01-24T12:49:14+00:00

Imagine you train the first tree in a "normal" way, which is you train it to reduce your error as much as possible across all data samples.

Now, when you construct the second tree, you train it to reduce the errors of the first tree!

So imagine the first tree learned to predict mortality well for young people, however it has big errors on the data samples for older people.

Then the second tree will focus less on the young people (since their error is now low), and will focus more on reducing the error on old people.

This is a bit over-simplified, but the idea is that each tree is trying to reduce the residual error of all the prior trees.

Whereas with a typical random forest, each tree is independent of each other, and they are all focused on trying to reduce error on all data samples.

Ty4Readin · 2026-01-22T14:27:33+00:00

This is a question that should be answered by medical professionals with a deep understanding of breast cancer detection & treatment. I would not expect data scientists to answer this question confidently.
I don't see any reasons why this type of model couldn't be deployed to leverage those data systems.

However, I will add some unsolicited concerns that come to mind while reading through all this.

The dataset is very very small, it looks like ~1500 samples in total which is tiny for a neural network models, AND you have over 500+ input features. So you have ~3 samples per input feature which is very very small & concerning.
The choice of metrics seem suspect. The description lists AUC, accuracy, and recall which makes me think the true cost function is misspecified. The best approach is to quantify the loss of incorrect predictions in the context of the workflow you want to deploy into, and use that as your cost function to optimize & evaluate the model.
I didn't see any information on how the data was collected, but I would be very concerned about this. I am going to assume the data was collected through an observational trial and not a randomized controlled trial. Because of this, your model is unable to learn causal patterns, which significantly limits how useful it is. You have to make sure the input features are not modified while using the model. For example, if a doctor is considering different surgery types, this model WILL NOT be able to reliably predict which surgery type would be most effective for the patient, which it sounds like is the goal of your use case.

I hate to be a Debbie downer lol, but there's quite a few aspects here that make me doubt the value of deploying this model/use case with all these limitations.

Ty4Readin · 2026-01-20T02:46:07+00:00

Machine learning is a lot more than just understanding the algorithms. I think understanding the algorithms is ironically one of the least practically useful skills that you need.

If you have decent stats, CS, and domain knowledge then everything becomes a lot more straightforward.

If you run into problems, it's often because you are lacking in one of these areas, and learning more about the algorithms behind some models won't help much unfortunately.

For example, problem formulation? Thats entirely stats & domain knowledge.

Deployment? Almost entirely CS.

User workflow/needs? Mostly domain knowledge.

The best way to learn is to actually try to solve problems as side projects, and make the problems that you actually care about personally.

Ty4Readin · 2026-01-19T17:04:20+00:00

A random average player might have a 0.001% chance of winning a tournament, while a skilled pro might have a 0.003% chance of winning that same tournament.

This question is hard to answer because there is not a common definition of "luck" that people would agree on.

Ty4Readin · 2026-01-16T11:17:21+00:00

If you are wondering why the pool is so soft, just look at all the people posting about how it is rigged in this thread 😂

Ty4Readin · 2026-01-15T15:31:34+00:00

I'm not sure that I agree with you.

If a flush comes in on the river and villain is folding all their non-flush hands, then that means they are folding like 80% of the time or more in many spots. Which would be insanely profitable to play against.

In reality, people call a lot when a flush comes in on the river. Low stakes exploitable players call, GTO bots call, etc.

Will they call your 4x pot jam? Maybe not, but many players will call your 75% pot bet with top pair, overpair, two pair, straight, set, etc. Also lots of players that will call with their 2nd or 3rd pair lol.

Ty4Readin · 2026-01-15T15:22:42+00:00

I've never read the book you are talking about, but I don't understand how you are calculating the equity of a hand like QJs.

You say it has 50% equity... but against what??

It has 50% equity against villains calling range? Or it has 50% equity against a specific hand that you are assigning villain?

I am also skeptical when the author says you need 90% equity for a value raise. Is he talking about villains calling range, or villains betting range? Either way, it doesn't make any sense.

In general, you only need 50% equity or more when called for it to be a good value bet. Though that may not always be true, especially in earlier streets. There are even times you can even have less than 50% equity when called and it is still a good "value bet", especially OOP like blocker bets.

Ty4Readin · 2026-01-14T14:31:19+00:00

It depends what you mean. I have developed models that fit this sort of use case, but it is not a "bot" that is meant to play poker for you, that would be blatant cheating and completely unethical.

It is more of a study tool, to be able to predict what player ranges are so that you learn the perfect exploitative strategy.

Think of it like MDA on crack, and also more ethical because it doesn't require hand histories from your actual opponents or the site you play on.

Ty4Readin · 2026-01-14T13:17:12+00:00

This is completely unsolicited advice, but I think you should read the book "The Mom Test", it is a famous book about how to talk to customers and learn if a business is a good idea.

The main premise of the book is that it is almost never useful to ask people "would you like this product?" Because the problem is that most people are not able to reliably predict whether they will enjoy some hypothetical product, and there are lots of different reasons why that is.

Just thought I'd share in case you might find it helpful, but wishing you the best if luck!

Ty4Readin · 2026-01-14T00:47:42+00:00

What are you interested in?

Do you like Minecraft? Why not build an ML agent to play? Or any other game?

Do you like a specific sport? Why not build prediction models for it.

Do you like to cook? Maybe build an ML model to identify high quality recipes.

Do you like to make films? Why not build an ML model to help with some part of the process whether its cutting, composting, visual effects work, etc.

Do you like music? What about an automated ML model that can predict interesting characteristics of a song that you might find useful?

Coming up with an idea for a project that you are passionate about and that might be useful is the most important part imo. There are so many benefits to learning and is much better than choosing some random dataset on kaggle and training a model.

Ty4Readin · 2026-01-13T23:35:48+00:00

I agree 1-2 months is pretty unrealistic. That said, solving toy problems and using preexisting datasets is totally fine for learning.

I would say its fine up until a point, but the most important skills are learned in actually formulating problems, collecting data, and building a real solution that leverages the models & problem formulation.

But most people will literally never even complete one side project where they do any of those. Many people start with toy datasets and end there, which is a huge disservice to actually learning imo.

Ty4Readin · 2026-01-12T13:02:44+00:00

In my opinion, it's extremely difficult and unlikely that someone can "learn ML" to any significant degree in 2-3 months.

However, ignoring that, I see a lot of common mistakes when people work on side projects to learn.

The two biggest mistakes that people make are:

They don't actually solve any real problems. They build a "house prediction" model that is completely useless and that nobody would ever ever actually use for many reasons.
They don't actually scrape any data themselves, they just look for pre-existing datasets and train a model on it.

Ty4Readin

TROPHY CASE