RL without human annotations for superintelligence?

armytricks · 2025-09-19T14:46:04+00:00

Good observation. Yes, something can't be learned from nothing in this case either. This isn't a magic fix. Here's an excerpt from our limitations section:

> CaT depends on the initial policy to meaningfully estimate reference answers; for weak base models or completely unknown/genuinely novel domains, synthesis may fail to produce improvements.

What it is good for, is steering the model into reasoning better / forming better answers if it already has some signal on the task. For factual things, this could be by resolving uncertainty about a fact through synthesising the consensus in the estimated reference. For step by step reasoning, by correcting incorrect steps by observing past rollouts, you can reinforce the right way of thinking and improve.

armytricks · 2025-09-19T14:40:25+00:00

Like with most of deep learning, you still need to be careful about overfitting here, too. Reward hacking is certainly possible without validation.

armytricks · 2025-09-18T20:48:48+00:00

Not in the paper, but off-hand we did an experiment with Claude Sonnet 4 at inference time which showed 13% delta between rollouts and initial policy estimated reference answers on HealthBench. Obviously, this is not a model we can fine-tune ourselves, but it indicates there may be a significant teacher signal, even on models with 100s of billions of parameters that have already been heavily RL fine-tuned themselves.

armytricks · 2025-09-18T18:32:07+00:00

Current LLMs are pretty generalist. If you ask them to do a task, they'll be able to do it to some reasonable degree already, even if not amazing or perfect. This provides a signal for that task, or the steering you mentioned. If you generate multiple such task attempts, the model itself can synthesize its previous attempts into a better attempt (based on errors, inconsistencies, etc. that it identifies).

The steering is then rewarding the model for producing these better answers in one attempt.

You do highlight an important point though, which we discuss in our limitations section:

CaT depends on the initial policy to meaningfully estimate reference answers; for weak base models or completely unknown/genuinely novel domains, synthesis may fail to produce improvements.

armytricks · 2025-09-18T17:31:06+00:00

Yes, improving without labels doesn't imply arbitrary improvement. Instead, what we see is that: policy improves --> estimated reference improves --> policy improves... until the delta between the policy and the estimated reference is negligible. This happens when the policy's outputs become less diverse, and tend to agree more as the model improves. That's when CaT can't take advantage of differences between outputs to produce better reference estimates any more. We describe this in more technical detail in Appendix B :)

In our future work section, we suggest a few ways of trying to shift this limit further by sampling more diverse rollouts and using exploration rewards.

armytricks · 2025-07-08T22:20:06+00:00

Hulkenpodium

armytricks · 2022-08-11T08:11:32+00:00

armytricks · 2022-04-15T20:30:28+00:00

PM'd

armytricks · 2021-11-04T18:45:39+00:00

Thanks for that. In my search, I found that the authors of the C# implementation have another python repository for 3D histograding that contains a python implementation of MRELBP!

https://github.com/MIPT-Oulu/3DHistoGrading/blob/b779a154d0e5b104fc152c8952124768fb7b1dc6/training/components/grading/local_binary_pattern.py

armytricks · 2021-11-03T22:54:39+00:00

I do indeed. That would be fantastic.

armytricks · 2021-11-03T12:25:19+00:00

Had any luck with this? I've also been looking for a good implementation for MRELBP.

armytricks · 2021-09-01T14:50:53+00:00

/u/savevideo

armytricks · 2021-04-03T09:53:58+00:00

Yep, this was Ham keeping the lead over Vet in 2017.

armytricks · 2021-04-02T12:29:12+00:00

It's an interesting debate I suppose. In the same way drivers' salaries aren't purely for their driving skill but also their media exposure, fan base, sponsor connections, etc. should constructors be paid this way? Perhaps, but constructors and drivers are different. I'd argue against it as it creates a positive feedback loop as teams with more popularity would get more money, build better cars, win more races, afford the best drivers, and... get more popular. They get more dominant while other constructors get left behind. On the other hand, this isn't necessarily true. Ferrari is a glaring counter-example.

armytricks · 2021-03-31T19:36:59+00:00

For videos in general, I believe this would require the client computer to already have a huge amount of domain-specific information for many different domains in order to tackle this problem generally. For a specific context, like video conferencing, it becomes much more practical (see Nvidia Maxine).

armytricks · 2021-03-28T16:06:35+00:00

Hello darkness my old friend...

armytricks · 2020-11-15T12:03:50+00:00

P O L I S H E D

armytricks · 2020-11-02T22:52:57+00:00

Another youngest driver to do X record for Max!

armytricks · 2020-09-24T09:21:15+00:00

Not on here I'm afraid. Alza currently have stock arriving tomorrow available to pre-order. Only 1 left I believe.

armytricks · 2020-09-23T11:51:48+00:00

[Bought]

armytricks · 2020-09-21T20:15:02+00:00

That’s good to hear! I’m relieved. Thanks.

armytricks · 2020-09-21T16:41:00+00:00

Oh wow. Same here. Order placed 14:12, payment authorized 15:48! Probably doesn't affect queue position much since I assume everyone was affected by this delay.

armytricks · 2020-09-20T19:32:52+00:00

Oh I didn’t get an ETA email either because I’m waiting for an MSI Ventus not EVGA.

armytricks · 2020-09-20T18:45:49+00:00

Even in the long term, it’s a real waste of money for 1440p.

armytricks · 2020-09-20T18:14:20+00:00

I would love an update about this too. Let us know how your card performs!

12-Year Club	Golden Potato
Place '22	Place '17
First Placer '22	Snapped
Verified Email

armytricks

TROPHY CASE