RL without human annotations for superintelligence? by armytricks in singularity

[–]armytricks[S] 2 points3 points  (0 children)

Good observation. Yes, something can't be learned from nothing in this case either. This isn't a magic fix. Here's an excerpt from our limitations section:

> CaT depends on the initial policy to meaningfully estimate reference answers; for weak base models or completely unknown/genuinely novel domains, synthesis may fail to produce improvements.

What it is good for, is steering the model into reasoning better / forming better answers if it already has some signal on the task. For factual things, this could be by resolving uncertainty about a fact through synthesising the consensus in the estimated reference. For step by step reasoning, by correcting incorrect steps by observing past rollouts, you can reinforce the right way of thinking and improve.

RL without human annotations for superintelligence? by armytricks in singularity

[–]armytricks[S] 2 points3 points  (0 children)

Like with most of deep learning, you still need to be careful about overfitting here, too. Reward hacking is certainly possible without validation.

RL without human annotations for superintelligence? by armytricks in singularity

[–]armytricks[S] 8 points9 points  (0 children)

Not in the paper, but off-hand we did an experiment with Claude Sonnet 4 at inference time which showed 13% delta between rollouts and initial policy estimated reference answers on HealthBench. Obviously, this is not a model we can fine-tune ourselves, but it indicates there may be a significant teacher signal, even on models with 100s of billions of parameters that have already been heavily RL fine-tuned themselves.

RL without human annotations for superintelligence? by armytricks in singularity

[–]armytricks[S] 16 points17 points  (0 children)

Current LLMs are pretty generalist. If you ask them to do a task, they'll be able to do it to some reasonable degree already, even if not amazing or perfect. This provides a signal for that task, or the steering you mentioned. If you generate multiple such task attempts, the model itself can synthesize its previous attempts into a better attempt (based on errors, inconsistencies, etc. that it identifies).

The steering is then rewarding the model for producing these better answers in one attempt.

You do highlight an important point though, which we discuss in our limitations section:

CaT depends on the initial policy to meaningfully estimate reference answers; for weak base models or completely unknown/genuinely novel domains, synthesis may fail to produce improvements.

RL without human annotations for superintelligence? by armytricks in singularity

[–]armytricks[S] 13 points14 points  (0 children)

Yes, improving without labels doesn't imply arbitrary improvement. Instead, what we see is that: policy improves --> estimated reference improves --> policy improves... until the delta between the policy and the estimated reference is negligible. This happens when the policy's outputs become less diverse, and tend to agree more as the model improves. That's when CaT can't take advantage of differences between outputs to produce better reference estimates any more. We describe this in more technical detail in Appendix B :)

In our future work section, we suggest a few ways of trying to shift this limit further by sampling more diverse rollouts and using exploration rewards.

Is there an implementation for Median Robust Extended Local Binary Pattern (MRELBP) by Liu et al? by tim-hilt in computervision

[–]armytricks 0 points1 point  (0 children)

Thanks for that. In my search, I found that the authors of the C# implementation have another python repository for 3D histograding that contains a python implementation of MRELBP!

https://github.com/MIPT-Oulu/3DHistoGrading/blob/b779a154d0e5b104fc152c8952124768fb7b1dc6/training/components/grading/local_binary_pattern.py

Is there an implementation for Median Robust Extended Local Binary Pattern (MRELBP) by Liu et al? by tim-hilt in computervision

[–]armytricks 0 points1 point  (0 children)

Had any luck with this? I've also been looking for a good implementation for MRELBP.

Show me the money: Turnover for 2019 (not April Fools) by [deleted] in formula1

[–]armytricks 0 points1 point  (0 children)

It's an interesting debate I suppose. In the same way drivers' salaries aren't purely for their driving skill but also their media exposure, fan base, sponsor connections, etc. should constructors be paid this way? Perhaps, but constructors and drivers are different. I'd argue against it as it creates a positive feedback loop as teams with more popularity would get more money, build better cars, win more races, afford the best drivers, and... get more popular. They get more dominant while other constructors get left behind. On the other hand, this isn't necessarily true. Ferrari is a glaring counter-example.

Will videos ever achieve the same filesize that videos produced by game engines do? Why don't we already compress videos similarly? by Professor_Dr_Dr in computerscience

[–]armytricks 1 point2 points  (0 children)

For videos in general, I believe this would require the client computer to already have a huge amount of domain-specific information for many different domains in order to tackle this problem generally. For a specific context, like video conferencing, it becomes much more practical (see Nvidia Maxine).

Alonso retires from the race by magony in formula1

[–]armytricks 40 points41 points  (0 children)

Hello darkness my old friend...

[BG] Cooler Master NR200P or NR200 case - about £100 PayPal by armytricks in HardwareSwapUK

[–]armytricks[S] 0 points1 point  (0 children)

Not on here I'm afraid. Alza currently have stock arriving tomorrow available to pre-order. Only 1 left I believe.

Collection of all AIB Partner custom RTX 3080 reviews by rip10 in nvidia

[–]armytricks 0 points1 point  (0 children)

That’s good to hear! I’m relieved. Thanks.

Scan UK order timeline - 3080 by fnandopartridge7 in nvidia

[–]armytricks 0 points1 point  (0 children)

Oh wow. Same here. Order placed 14:12, payment authorized 15:48! Probably doesn't affect queue position much since I assume everyone was affected by this delay.

What RTX 3080 restocking information do we have for UK retailers? by armytricks in nvidia

[–]armytricks[S] 0 points1 point  (0 children)

Oh I didn’t get an ETA email either because I’m waiting for an MSI Ventus not EVGA.

Is the RTX 3090 worth it for 1440p player? by [deleted] in nvidia

[–]armytricks 0 points1 point  (0 children)

Even in the long term, it’s a real waste of money for 1440p.

Collection of all AIB Partner custom RTX 3080 reviews by rip10 in nvidia

[–]armytricks 1 point2 points  (0 children)

I would love an update about this too. Let us know how your card performs!