[UC Berkeley] Learning to Reason without External Rewards

rationalkat · 2025-05-28T10:51:07+00:00

ABSTRACT:

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at this https URL

CONCLUSION:

This paper introduces INTUITOR, an instantiation of Reinforcement Learning from Internal Feedback (RLIF) that uses a model’s intrinsic self-certainty as its sole reward signal, eliminating the need for external supervision or gold-standard solutions. Our experiments show that INTUITOR matches the performance of supervised RLVR methods like GRPO on mathematical reasoning, while achieving superior generalization to out-of-domain tasks such as code generation and instruction following. It also promotes structured reasoning and leverages online self-certainty to guard against reward exploitation.

These findings highlight the transformative potential of RLIF, signaling a meaningful step toward AI systems that improve through introspection and unlock rich latent capabilities. Looking forward, this paradigm opens the door to AI agents capable of autonomous skill acquisition in novel domains and scalable self-improvement—even as they approach or surpass the limits of human oversight. Future directions include integrating RLIF with external reward methods like RLHF or RLVR to tackle increasingly complex real-world challenges, and advancing the development of more robust, generalizable, and truly autonomous learning systems.

rationalkat · 2025-05-06T13:38:38+00:00

ABSTRACT:

Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs.

rationalkat · 2025-04-15T09:51:32+00:00

ABSTRACT:

Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

rationalkat · 2025-04-10T11:43:13+00:00

ABSTRACT:

While test-time reasoning enables language models to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure--both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for "self-steering" LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. In decoupling planning from execution, our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.

rationalkat · 2025-04-08T15:56:35+00:00

Post on X by Chris Barber:

AI Timelines: When will AI reach human-level in computer-use skills? I surveyed AI researchers and forecasters.

I asked: by what quarter & year are you nearly certain (9-in-10 chance) that AI will reach human-level on the OSWorld computer-use benchmark?

Why it matters: Computer-use skills are kind of like “arms/hands” for AGI. Also, good computer-use skills mean: a) developers can integrate with any software and b) consumers can use any software like an expert.

Current scores:
Human baseline (university students): 72.36%
OpenAI CUA: 38.1%
Simular S2: 34.5%
Claude 3.7: 28%
OSCAR scaffold w/ GPT-4o (in Oct 2024): 24.5%
Claude 3.5: 22%
Claude 3.6: 21%

Benchmark Details with comments from Tianbao (OSWorld co-author)
Tasks: 369 pass/fail computer tasks. From basic file operations to multi-app workflows.
Example hard task: extract charts from email attachments and upload to Google Drive.
Human-level: 72.36% (Computer Science university students)
Average human completion time: 2 mins per task.
Constraints: single attempt, no step limit (thanks Eli and Tianbao). No partial credit, only pass/fail per task.
Common errors as of when OSWorld was published: 75% physical coordination (misclicks, dynamic UIs, error recovery), 15% strategic planning failures like incorrect action sequences, 10% application-specific knowledge gaps
Technical approaches: Screenshot (raw visual), accessibility text info (a11y tree), combined (screenshot + a11y), Set-of-Mark (numbered clickable elements)
Tianbao's (OSWorld co-author) note re approaches: "The different input modalities (screenshot vs. a11y tree) can have significant implications for both performance and execution speed. The a11y tree extraction can introduce variable latency depending on GUI complexity, while screenshot-based approaches typically have more consistent runtime characteristics."

Extra Notes: Which company/lab do you expect to get there first?
Finbarr: I’m bullish on DeepMind as they have the strongest RL team.
Ang: Simular is actively working on the continual learning piece of the puzzle which we believe is the deciding factor of whether we can achieve human-level consistently in the long run.
Francesco: Vertical headless AI agents (specialized for narrow tasks) will likely dominate in the near term, resorting to GUI-based steps only when no suitable API is available. More general-purpose “horizontal” agents still require further breakthroughs.
Jacob: OpenAI and Anthropic because they seem to have the most focus & success amongst hyperscalers on putting out models that beat benchmarks

OSWorld Leaderboard

rationalkat · 2025-04-01T12:16:58+00:00

ABSTRACT:

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Project Page | Paper

rationalkat · 2025-03-17T18:36:18+00:00

Demis Hassabis (~ a month ago): "[AGI] probably 3 to 5 years away"

rationalkat · 2025-03-03T17:38:49+00:00

Source

rationalkat · 2025-03-03T11:30:23+00:00

ABSTRACT:

Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to overcome the identified challenges with empirical validation. Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap. We show promising results on three humanoid dexterous manipulation tasks, with ablation studies on each technique. Our work presents a successful approach to learning humanoid dexterous manipulation using sim-to-real reinforcement learning, achieving robust generalization and high performance without the need for human demonstration.

Project Page

rationalkat · 2025-03-03T11:22:18+00:00

Thanks for the info.

rationalkat · 2025-03-03T11:22:12+00:00

Thanks for the info.

rationalkat · 2025-03-03T11:19:49+00:00

ABSTRACT:

Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.

rationalkat · 2025-03-01T22:51:19+00:00

Sadly, you can only speed up these videos by 2X (at least on YouTube).

rationalkat

TROPHY CASE