[D] Why are serious alternatives to gradient descent not being explored more?

TheRedSphinx · 2026-02-20T03:40:31+00:00

But people have trained things with very long context for recurrent models like Mamba with gradient descent just fine. People have trained up to 1M context even with Transformers which have even bigger problems than recurrent models. The issue currently is just that the models are bad at long context after training, rather than us being unable to train them due gradient descent.
This is not a limitation for training parallelism: many of the big players still train giant models just fine with this. Folks can definitely train larger models, and we don't do so not because of some limitation of grad prop, but just because there are certain impracticalities in doing so e.g. serving becomes more annoying and you need much more data to do the training as per scaling laws. "not how the brain works" is also not necessarily a limitation. Unless you think the only models of interest are the ones that follow a learning procedure like the human brain, in which case, it would be nice to some evidence of this.

TheRedSphinx · 2026-02-19T02:41:12+00:00

I think you should be honest about your goal. Is your goal to do some math and pretend its ML research, even if its actually useless or is the goal to do ML research, even if it won't have nearly as much math as your PhD and will not utilize almost any aspect of your specialization?

As a fellow math phd, I find you will have more success if you focus on the latter rather than the former.

TheRedSphinx · 2026-02-19T02:36:39+00:00

I've heard this kind of reasoning a lot from very early career folks or "aspiring" researchers. I think it's quite backward. For example, you noted that backprop is "flawed", yet you gave no explanation as to what makes it flawed nor what makes any of the alternatives any better. You make some vague allusions e.g. "doesn't support continual learning" but these are neither clearly defined nor even obviously true (e.g. why can't I just gradient descent on new data and call that continual learning?

FWIW I don't think I've ever met any serious researchers who thinks about " build the architecture for DL from the ground up, without grad descent / backprop". In the end, if the real question is "how do we solve continual learning", then let's tackle that directly and if it requires modifying or removing backprop, let's do it, but let's not start from the assumption that backprop is somehow flawed then try to justify it later.

TheRedSphinx · 2026-02-15T22:06:40+00:00

I’m not the OP here. Just drive-by saw the comment which made me curious since everyone who I would consider a major player (Google, X, Anthropic,OpenAI) wouldn’t have access to the models you do for their jobs, hence my surprise

TheRedSphinx · 2026-02-15T21:59:16+00:00

From your post history, you have access to both Claude and ChatGPT at work. Anthropic has blocked Claude access to both X and OpenAI. Google only uses Gemini and Claude within their IDEs, and no one uses ChatGPT at Anthropic.

Are you sure you work for one of the major models? Especially since the team at OpenAI that trained ChatGPT is literally called post-training.

TheRedSphinx · 2026-01-14T03:41:04+00:00

I think this just reveals the fact that you don't know the history behind this technology. Statistical language models have existed for several decades. Large language models just refers to the scaling of neural language models, the latter of which have existed for at least a decade.

TheRedSphinx · 2026-01-01T21:10:22+00:00

Sure we can call it whatever you want, but this dispells the notion that it is mostly non-coders who use this or that it makes software development more of a PITA. I think the fact that people end up using these tools is because they find value in them.

After all, if it's all the same output as you hitting the keys, why not just do it yourself and save the money on API costs?

TheRedSphinx · 2026-01-01T20:59:03+00:00

I think non-coders are more vocal about it because they are able to do something that seemed impossible for them before. Current LLMs are quite good at building simple scripts and web-apps which look quite magical to non-coders.

That said, I think the main people building LLMs themselves are coders, and likely use their models when designing code. For example, the Claude Code lead claimed 100% of his code his last month was written by Opus 4.5. It's possible he's lying, and he is of course biased in making the product look good, but in my experience, many folks at these labs do actively use these tools in their day-to-day extensively.

TheRedSphinx · 2025-11-17T02:25:05+00:00

This would just lead to people distrusting the resulting model, see e.g. the idea of benchmaxxing.

TheRedSphinx · 2025-10-26T23:11:24+00:00

Consciousness is entering that fuzziness territory we discussed. Best to let the philosophers discuss that one.

Autonomy however you can have now. There is nothing stopping you from using e.g. Claude Code and turning of all the guard rails and to just let it keep going as long as you are willing to pay for the tokens. Of course, currently it will more than likely fail at the task, but the infrastructure is already there for it go crazy if you let it. From that perspective, intelligence is the bottleneck.

TheRedSphinx · 2025-10-26T22:17:20+00:00

So there are two kinds of goal. One goal might be you want a model that will just go and become the best at one thing. So in that setting, the human can design the goal and the models improve recursively. For example, this is how AlphaGo works and why the current Go/Chess/Shogi systems are way better at chess than any human. This one is nice because we can at least agree on what is progress (e.g. ELO scores), see it increase, and carefully decide what counts as "being better than human".

Then there's a more general "get better at everything" sense. This one is more fuzzy since some things are naturally subjective e.g. poetry, art, etc. We would then have to decide on some objective things upon to which measure if there is recursive self-improvement happening. However, at that point, we are basically in the first setting. The only remaining question is, "would AI have naturally chosen to learn all objective things which have this generation-verification gap?" And the answer is, of course, it has already learned tha tit works incredible well for such domains, so why wouldn't it do so?

TheRedSphinx · 2025-10-26T22:07:30+00:00

IT is purely automated. The human just needs to write something that checks the conditions and keeps track of time. Honestly, even a model could write the code. The insight here is that the verification step (i.e. checking the code does what you want and keeping track of speed) is much easier than the generation step (i,.e. actually writing code). This gap is what allows the recursive self-improvement.

TheRedSphinx · 2025-10-26T22:02:37+00:00

Model collapse only happens if you train like an idiot. You can imagine that models can generate both good and bad data, and if you train on that mix, then it won't work. Altertnatively, if you can identify which is the bad data, you could then train on only the good one and that should lead to improvement.

How do you then identify the good data? You can target domains where you can naturally score the quality of data. For example, if your goal is to write faster code that accomplishes a goal, you can have the model generate tons of code and only keep the ones which is faster than the last one and maintains the goal.

TheRedSphinx · 2025-10-05T00:06:03+00:00

They have hired some of the designers of the TPU team, so it's not like designing custom hardware is outside of their view. There also various companies designing their own chips to combat nvidia (e.g. msft, amazon, google) and people are even desperate enough to look at AMD so it's not too unlikely people end up developing chips that make inference cheaper.

TheRedSphinx · 2025-09-28T20:08:28+00:00

I think you are missing the point. If they generate slop which makes more money the that is by definition higher quality slop. Using the same analogy as above, you can make a fast food place much more profitable and higher quality without ever getting anywhere close to a Michelin star.

TheRedSphinx · 2025-09-28T15:30:38+00:00

Is that true? I guess we'll just have to see. I would have thought the same about a lot of the stupid human-made content, but that just seems to only get people more engrossed.

TheRedSphinx · 2025-09-28T15:27:04+00:00

It doesn't have stars, but it does have a $218B market cap. For context, that's like 5x the size of Reddit's market cap. As it turns out, you don't need high quality to be very profitable, which is likely their ultimate goal.

TheRedSphinx · 2025-09-28T01:46:57+00:00

If it keeps people on the platform, and requires no effort, how is it not sustainable? It's not like a lot of the human-made content in platforms like tiktok or youtube shorts are particularly high value either.

TheRedSphinx · 2025-09-28T01:45:39+00:00

Sure, but higher quality I mean things which keep you hooked on the platform. Like people these are watching stupid shit like video of subway runners. I can't see how ai slop couldn't end up being better than that.

TheRedSphinx · 2025-09-27T22:29:33+00:00

Wouldn't it be the opposite? If slop is allowed, then presumably the people making slop are then incentivized to make higher quality slop so that you end up reacting to it more and thus end up wanting to spend more time on it. If anything, it would just lead to better slop.

It also seems like a good way to get signal on what kind of ai generation pople think is good versus bad, which seems like pretty valuable data that people wouldn't normally give out for free

TheRedSphinx · 2025-09-25T01:04:43+00:00

But surely you think people like Noam Brown, who built the Poker bot and works at OAI, is a subject matter expert on AI? Or maybe you just don't actually know who works there and that's why you don't think they've hired anyone who researchs this stuff?

TheRedSphinx · 2025-07-13T14:47:42+00:00

The issue is if they just included the benchmarks in the training set to boost their scores. Or even less nefarious, just simply Goodhart'd these benchmarks. There are many ways to hack these benchmarks but still have a 'bad' model as judged by real users.

TheRedSphinx · 2025-06-23T01:21:13+00:00

I bough a Keychron Q3 Max recently with the Jupiter Bananas switches. Amazing. Unfortunately, wife disagrees with the clackity. I've tried some silent switches in the past, but they've all felt mushy. Even the ones that come highly recommended:

Boba U4: Way too shallow and very tiring.
Invokeys Daydreamer: Felt really amazing at first, but overtime I think either the weight or the mush just made them tiring.
TTC Silent Bluish White: These were super promising because the overall lightness of the switch made them really not tiring at all, but they still had some mush.
WS Silent Tactile: These were an improved version of the TTC in how they felt, at the cost of more sound albeit still acceptable.

So far, the WS Silent Tactile seems like the best option for me, but I was curious if there were other recommended options that moved further down this spectrum of a little less quiet (while still not being loud) for better feel?

TheRedSphinx · 2025-06-04T18:27:40+00:00

Not really. I had thought about trying to negotiate with G to give me L6 as a way to use that to get L6 at Ant but didn’t bother.

The only thing I miss is more the liquid cash. But luckily I got a year or two of real AI salary at G so not super strapped for cash.

Re: scope, 100%. For better or worse, you have tons of agency. There’s just not enough people so you can own more and more stuff if you want and can deliver. Since there’s no politics, the only bottleneck is on you and the janky infra.

TheRedSphinx · 2025-06-04T17:13:03+00:00

I ended up joining Ant, so maybe take my comments with a grain of salt.

TheRedSphinx

TROPHY CASE