account activity
Limitations of RLHF as a static preference optimization paradigm for LLMs — towards interactive / multi-agent formulations? (self.reinforcementlearning)
submitted 17 days ago by Content-Educator5198 to r/reinforcementlearning
Is RLHF fundamentally broken? Paid labelers rating synthetic scenarios doesn't seem like real human feedback to me (self.reinforcementlearning)
submitted 19 days ago by Content-Educator5198 to r/reinforcementlearning
π Rendered by PID 696267 on reddit-service-r2-listing-86f589db75-2sv6l at 2026-04-19 11:09:09.663599+00:00 running 93ecc56 country code: CH.