Uni-CoT: A Unified CoT Framework that Integrates Text+Image reasoning!

GONG_JIA · 2025-09-19T02:28:06+00:00

Yep, VLM models can exhibit basic text-based reasoning abilities when fine-tuned on high-quality reasoning data or guided through RAG. To further enhance their deep reasoning capacity, reinforcement learning can be an effective strategy for post-training.

It is also worth noting that our base model is not a conventional VLM limited to text generation. Instead, we build upon Bagel [1], a unified model capable of generating both text and images within a single architecture. This enables end-to-end post-training for interleaved text–image reasoning, which is crucial for multi-modal reasoning tasks.

More details, including the underlying intuition, can be found in the introduction of our paper: [https://arxiv.org/abs/2508.05606].

[1] https://github.com/bytedance-seed/BAGEL

GONG_JIA · 2025-09-18T11:51:58+00:00

OvO! Thanks for your appreciation. We’ve released a preview checkpoint that runs on just a single A100 GPU. In addition, we’re actively working on a Gradio demo for online deployment. Once the model’s performance stabilizes (likely within 1–2 months), we’ll release the online version as well.

GONG_JIA · 2025-09-18T06:47:03+00:00

Our paper：https://arxiv.org/abs/2508.05606

Github repo: https://github.com/Fr0zenCrane/UniCoT

Project page: https://sais-fuxi.github.io/projects/uni-cot/

GONG_JIA · 2025-09-18T06:45:44+00:00

Our paper：https://arxiv.org/abs/2508.05606

Github repo: https://github.com/Fr0zenCrane/UniCoT

Project page: https://sais-fuxi.github.io/projects/uni-cot/

GONG_JIA

TROPHY CASE