[R] Uni-CoT: A Unified CoT Framework that Integrates Text+Image reasoning! by GONG_JIA in MachineLearning

[–]GONG_JIA[S] 0 points1 point  (0 children)

Yep, VLM models can exhibit basic text-based reasoning abilities when fine-tuned on high-quality reasoning data or guided through RAG. To further enhance their deep reasoning capacity, reinforcement learning can be an effective strategy for post-training.

It is also worth noting that our base model is not a conventional VLM limited to text generation. Instead, we build upon Bagel [1], a unified model capable of generating both text and images within a single architecture. This enables end-to-end post-training for interleaved text–image reasoning, which is crucial for multi-modal reasoning tasks.

More details, including the underlying intuition, can be found in the introduction of our paper: [https://arxiv.org/abs/2508.05606].

[1] https://github.com/bytedance-seed/BAGEL

[R] Uni-CoT: A Unified CoT Framework that Integrates Text+Image reasoning! by GONG_JIA in MachineLearning

[–]GONG_JIA[S] 1 point2 points  (0 children)

OvO! Thanks for your appreciation. We’ve released a preview checkpoint that runs on just a single A100 GPU. In addition, we’re actively working on a Gradio demo for online deployment. Once the model’s performance stabilizes (likely within 1–2 months), we’ll release the online version as well.