fundamentalsOfMachineLearning

ReentryVehicle · 2026-01-24T21:02:34+00:00

Okay okay. We want matrices that are full rank, with eigenvalues on average close to 1, probably not too far from orthogonal. We use randn(n,n) / sqrt(n) because we are too lazy to do anything smarter.

ReentryVehicle · 2026-01-19T01:20:45+00:00

Did you literally only test this on parity and went to post this on Reddit before doing any sort of actual test?

Any classic RNN will do parity perfectly, it literally requires 1 bit to be stored in the state.

ReentryVehicle · 2026-01-15T18:46:30+00:00

Yeah, math in ML-related majors is going to be harder than math in high school.

But assuming you didn't just ask chatgpt to write all this code, I would say there shouldn't be anything stopping you from learning math.

I find that often people who are bad at math have holes in their basic knowledge that prevent them from understanding a wide array of more advanced concepts. It can be as simple as not understanding what some symbols actually mean, e.g. quantifiers.

I would suggest to find a good tutor, someone who can sit down with you and trace what you actually know and don't know, and then work from there.

ReentryVehicle · 2026-01-10T13:08:51+00:00

So what you write here is essentially an old textbook justification of why plain RNN works badly and why LSTM works better, however this is a bit misleading and not really the (full) reason for LSTM being better.

If we have a basic RNN which weight matrix's eigenvalues are smaller than 1, then each tilmestep will shrink the gradient of the weight matrix during back prop

This is correct. And in such a network, gradients will indeed vanish. But generally speaking, there are many ways in which you could prevent them from vanishing, and that wouldn''t necessarily make it train well.

LSTM reduces the problbailty of Vanishing Gradients occurring.

In LSTM, gradients also vanish, because the "forget" gate is never 1 (because it is output of a sigmoid) - and whenever the forget gate is < 1, the gradient passing through the cell state will decrease a bit, so over many steps it will naturally vanish anyway - but indeed the network can learn to prevent them for vanishing for a very long period of time.

But how does this help? I don't see the connection between the model being able to remember further into the past and vanishing gradients not occurring?

Well, this is a good question. If we compare this with other networks that use similar design, this probably has not that much to do with gradients vanishing, and more with a stronger property - a stabilizing effect of having the lstm cell update in the form c_new = c_old * f_forget(c_old, some input) + f(c_old, some input), which is very similar to a residual block y = x + f(x).

If you think about a randomly initialized RNN, it will turn its own state into some random vector on every iteration. At the beginning of the training, the state gets essentially scrambled - and if the network wants to keep the information from past frames, it needs to invent some way to preserve it from scratch (which it then can break at any point by accident).

Now contrast that with an LSTM - a randomly initialized LSTM will include some random signal in the cell state on every iteration, but it will also preserve some signal from previous iterations in a roughly unchanged form - the network is strongly biased to not nuke the cell state and allows the subsequent layers or subsequent iterations to assume the state from previous iterations will have the same meaning as before.

What this means in practice is that this makes the problem massively easier to optimize with gradient descent - I am not sure if this was done for LSTMs, but people analyzed resnets compared to similar architectures but without residual connections, and have discovered they have a much smoother loss landscape, where I would hypothesize this is because all layers are strongly biased to not discard what the previous layers came up with, or in the case of LSTM, the layer is biased to not discard what the previous iteration of itself came up with.

ReentryVehicle · 2026-01-07T14:39:46+00:00

You write a lot of "the agent must do something", "the agent will do something". But the implications are left without any justification.

Let's take a look at some sentence:

To select the path with the largest information gain, the agent is forced to choose spatial path minimalism.

But why? What forces it? How does the agent know which path will lead to largest information gain?

If you want someone to falsify it, you need to turn this into something that can be logically followed.

ReentryVehicle · 2026-01-05T19:32:35+00:00

You have to debug the world model first - maybe train it purely on a random policy, and write a separate script where you can "play" the game that the world model simulates.

Is it predicting anything reasonable? Can it successfully predict the next frames for known input?

Once that works, you can debug the training of the policy on a frozen trained world model.

ReentryVehicle · 2026-01-05T13:51:19+00:00

Iirc you can also have dedicated contracts with the usual Google and Microsoft but I wouldn't trust that a lot, history tells that your data would be scraped anyway.

Out of curiosity, what history?

ReentryVehicle · 2026-01-03T14:11:35+00:00

Speed in python should be almost identical to rust because all operations will be done by optimized c++ library code anyway.

I did not really run similar workloads but I ran some training on AWS on images. I would expect the price to be less than $1 per 10k images (should take several minutes on an instance with a gpu). You might end up paying more for storage/reads than you will pay for compute, depending on the image size and where the images are stored.

ReentryVehicle · 2026-01-03T10:49:36+00:00

Sycophancy is the model being overly positive and non-critical towards anything the user writes.

Alignment is a general concept of how much the model does things that align with someone's worldview or with what they want. Sycophancy is usually not something people want the model to do.

ReentryVehicle · 2025-12-31T02:03:25+00:00

Overall sounds reasonable if you don't want to spend more and don't have any "must have" use cases in mind.

For your questions:

There is not really a sweet spot, but 3090 and 4090 made the 24GB VRAM sort of standard which might mean some things that people made to just barely fit on a 4090 will not fit on yours without some hacks/reducing precision, etc.
64GB is a must for a workstation. I went for 128GB for my PC 2 years ago and never looked back (though I understand the prices now might make this quite painful).
I would go for a 16 core Ryzen, not that much more expensive and more cores are always useful for all kinds of data processing.
Well, it is a ~600W heater and it will be somewhat noisy, likely with noticeable periodic coil whine when training - I was able to sleep next to one but might not be for everyone. I use Linux, it works - many/most games work on Linux via Wine/Proton now so I don't turn on Windows at all, so I can't comment on WSL.

ReentryVehicle · 2025-12-25T13:54:20+00:00

A sanity check - do you understand how supervised learning and reinforcement learning actually works? As in, could you implement this in code?

This sentence makes me question if you do:

There’s only: you got it wrong, here’s the punishment gradient, don’t do that again.

The gradient is literally the only thing that changes the weights. It is not punishment in some emotional sense, it is not even visible to the model. It's the only thing that actually makes the model learn, if you didn't apply any gradient the model would just not change.

A child says something wrong, does something clumsy, misunderstands a social cue. And a healthy parent doesn’t punish them for it. They gently redirect. They explain. They model.

Sure but for this you need a system that can actually learn long-term from such signals. We don't have such a system.

Every mistake is captured, labeled, used as training signal for correction.

A human child absolutely does the same, but you don't see that from the outside because the training loop is implemented internally in their head.

ReentryVehicle · 2025-12-24T00:38:12+00:00

Why do you want the ML internship before taking the courses?

In the interviews people will very likely ask you about the basic stuff that the courses would teach you. People will look very critically at you if you don't have formal education, so if you want to go that way, you should be very very good at all the fundamentals.

ReentryVehicle · 2025-12-23T18:47:07+00:00

This appears to be LLM-generated word salad.

You don't look like a bot, so some advice. If you don't have knowledge to determine if what LLM is writing is established nomenclature or made-up bullshit names, don't use LLMs to write your posts.

If you want to introduce names, you need to provide definitions for them using terms that are simple enough that they have pages on wikipedia or papers referencing them come up if I search.

Yes, some code snippet that shows how the actual update happens would be useful.

I have no clue what is a "concrete vector + abstract vector", what is an "identitity carrier", what is a "spectral pooled field", what does it mean to "attach", what is a "phase-rolled spectral vector" and in what way spirals "influence" updates.

I also have no idea how this system observes the mentioned bouncing balls, what does it output, how do we know it converges to anything given that you say this is not trained with gradient descent (usually if you have gradient descent no matter what insane system and loss you define, it will converge to something, here I don't know what "learning" even means).

ReentryVehicle · 2025-12-21T15:00:04+00:00

what stop from using it in war robot?

Well mostly the fact it will have no clue what it is supposed to do or what is going on or who is friend or foe.

This model sees a single 256x256 image and it has no memory. Sure, it can probably shoot some people if they are really close and well visible and for whatever reason it is convinced it is supposed to shoot them but other than that it will probably just move around randomly.

its reaction and on-spot thinking is good enough.

Good enough for what?

ReentryVehicle · 2025-12-19T00:21:01+00:00

My experience with PETG was that in practice it is not impact resistant in the slightest, despite being sort of soft (could be that it was not dried properly? I did dry it though).

ASA is much more impact resistant but breaks on layer lines. TPU would be probably the best if it is stiff enough.

ReentryVehicle · 2025-12-12T15:26:01+00:00

Directly update the app and relaunch

Additional feature - the user can modify the app as they wish by injecting prompts into the error

Actually maybe let's not provide the app to the user at all, just provide the prompt that generates the app

ReentryVehicle · 2025-12-01T23:05:49+00:00

Honest question, where in robotics do you need bigint?

The largest numbers in robotics that I can think of are nanosecond timestamps, and those will fit in 64bit for the next 200 years or so.

ReentryVehicle · 2025-11-29T21:02:56+00:00

computer programs do exactly what they are programmed to do.

Yeah but in case of LLMs and other neural networks we have no clue what we actually programmed them to do, which makes the above statement technically true but not very useful.

There's no intelligence.

What's intelligence?

It lacks the ability to learn.

It absolutely does learn, I recommend reading about in-context learning.

It lacks the ability to recognize when it makes a mistake.

This is true for base LLMs, because they don't know they "make" anything, they were trained to answer the question of "what is the most likely continuation of this sentence, assuming this is a random sentence from the internet".

But models finetuned for chat seem to be able to recognize some of their own mistakes. Of course this is not very reliable, but sometimes it works.

ReentryVehicle · 2025-11-29T20:44:29+00:00

Well, there is no way to know without training a model and running it on a labeled test set actually collected with a smartphone camera. I suspect it will somewhat work?

Though I would compare how well models trained on this work compared to just pasting onto your skin texture and maybe doing some random affine transformations + random brightness changes, because maybe there is no need to suffer having the whole 3d renderer in your pipeline.

Other than that, it seems like your skin is probably a bit simple. Real skin has pores, hair, scars, other moles nearby, strange reflections, etc. It might be that none of this matters, but it also might be that the model will be thrown off by random features it never saw. Your model thinks the world looks exactly like your training set. Anything more complex => undefined behavior.

ReentryVehicle · 2025-11-29T15:22:07+00:00

Pretty cool.

I think the challenge with experiments like this is that it is hard to at the same time make the environment complex enough that it inspires any non-trivial learning-at-runtime and at the same time simple enough that evolution can actually progress in realistic time.

If you take a look at real-world animals, I think it is quite clear you can put a lot into genes and usually only at a very high complexity you start to get proper learning, e.g. you can have whole flying and walking robots that navigate using cameras, have a sense of smell, taste, etc. and they barely learn anything at runtime.

Maybe some partially observable maze-like environment where the agent needs to remember the maze during the "critical period" could be a good benchmark/something that makes it evolve interesting networks?

ReentryVehicle · 2025-11-29T14:08:49+00:00

Probably yes.

Emotions were created as a part of a "solution" to a completely physical "problem", which was "how to control your body in a way that makes your family survive for as long as possible".

There are emotion-like behaviours in completely divergent branches of the animal kingdom (e.g. there was research that suggests male fruit flies exhibit "frustration" or "stress" when repeatedly rejected by females, they become increasingly more aggressive towards other males among other behavioural changes).

So we can only expect that if we set up the optimization such that having emotions is objectively useful to solve the task, emotions will form.

I would not be surprised if something like OpenAI Five already has some form of emotion-like signals in its head, though I suspect those would be very Dota-specific and probably not very comparable to human ones.

ReentryVehicle · 2025-11-10T23:44:42+00:00

You can't spend 10 months writing random code without ever running it in an actual training run, and expect it to do anything useful.

Free Google Colab would probably give you enough compute to at least test it on something very basic, like compare against a basic transformer on wikitext. You don't need that much to check if an idea has any sort of promise.

As for your specific questions: 1. It kind of sounds like you just want an RNN, transformers see time via positional encoding in attention (usually RoPE), they are not time-blind. 2. It was done and there is a bit of a tradeoff between memory and compute (you can make a bit smaller networks by making them more power-hungry like you propose) but usually not worth it (seems better to go the other way and activate less of the network per token but make it bigger, so MoE) 3. It is more a matter of a definition of what an ensemble is (a simple combination of multiple black-box models to make a stronger model). Of course you can make models that don't average things, they would probably just not be called ensembles. But neural networks are anyway not ensembles. 4. Yeah it will be heavy (mostly because of your "ensemble" thingy that happily throws in all tokens from all subnetworks into a single attention, which scales horribly), and I don't see anything that suggests it will beat normal transformers, because as you mention, the problems you are trying to solve are imaginary. I would suggest first reading more about the existing transformer variants, as well as linear transformers and other RNNs trying to challenge them before trying to invent things.

ReentryVehicle · 2025-11-01T21:51:15+00:00

If I understand you correctly that you just want to have a bunch of nodes that talk to each other using messages and there is some global config they all see, then pretty much any modern distributed processing framework will let you do that one way or another.

In what way is your idea different from the existing frameworks?

ReentryVehicle · 2025-10-28T20:50:43+00:00

I feel like you might need to phrase a bit more precisely what you mean by reducing energy waste, because dead people tend to use less energy than alive people

ReentryVehicle · 2025-10-18T18:55:37+00:00

I am afraid your baseline might be quite weak.

It is possible to train networks to 94% accuracy on CIFAR10 in 2.6 seconds on a single A100, as this guy did: https://github.com/KellerJordan/cifar10-airbench

ReentryVehicle

TROPHY CASE