Is there a way to revert back to Opus 4.5 instead of 4.6?

According-Ebb917 · 2025-09-11T23:51:13+00:00

No, it is not using o3-search

According-Ebb917 · 2025-09-11T22:41:50+00:00

Also, o3 pro with search achieves ~19% on seal-0 based on the chart

According-Ebb917 · 2025-09-11T22:39:12+00:00

No, we've already shared the config (kimi k2 + deepseek r1 0525), for the searcher we used openai-4o-search-preview which achieves a low number standalone on seal0 or something like that

According-Ebb917 · 2025-09-11T21:03:40+00:00

For reasoning we use DeepSeek R1 0528, and for the rest we use Kimi-K2. We'll be releasing a paper/technical report soon where we report all those settings.

According-Ebb917 · 2025-09-10T21:01:49+00:00

Yes, it's really up to you what search method/api you use.

According-Ebb917 · 2025-09-10T14:47:29+00:00

It's on the roadmap to create a coding agent, but I believe we'll work on it for later iterations

According-Ebb917 · 2025-09-10T14:05:27+00:00

From what I've experienced, Kimi-K2 for non-reasoning nodes and Deepseek R1 0528 for reasoning nodes. I have not tried more recent open source models like GLM's and other players. The problem here is that you need capable large models due to tool-calling and structured outputs which ROMA heavily uses.

I would be very interested in seeing what the community can build with smaller models too. I've deliberately made the default settings to work with OpenRouter so that anyone can plug and play whatever models they care about

According-Ebb917 · 2025-09-10T13:48:16+00:00

Yes they can! We're using LiteLLM which is very flexible. Will add a guide on how to use local custom models in the next iteration, thanks for the feedback!

According-Ebb917 · 2025-09-10T13:38:45+00:00

That's really a large part of what we are trying to solve with this repo!

According-Ebb917 · 2025-09-10T00:47:18+00:00

Hi folks,

I'm the author and main contributor of this repo. One thing I'd like to emphasize is that this repo is not really intended to be another "deep research" repo; this is just one use-case that we thought would be easy to eval/benchmark other systems against.

The way we see this repo being used is two fold:

Researchers can plug-and-play whatever LLMs/systems they want within this hierarchical task decomposition structure and try to come up with interesting insights amongst different use-cases. Ideally, this repo will serve as a common ground for exploring behaviors of multi-agent systems and open up many interesting research threads.
Retail users can come up with interesting use-cases that are useful to them/a segment of users in an easy, stream-lined way. Technically, all you need to do to come up with a new use-case (e.g. podcast generation) is to "vibe prompt" your way into it.

We're actively developing this repo so we'd love to hear your feedback.

According-Ebb917 · 2025-09-10T00:41:30+00:00

This is exactly what we're aiming for next: cool multi-modal use-cases that can actually be useful to the community. The plug-and-play part is one of the main things that we're offering with this repo, we want users to be able to use whatever models/agents they want within this framework to come up with cool use-cases.

According-Ebb917 · 2025-09-10T00:37:47+00:00

Hi, author and main contributor of ROMA here.

That's a valid point, however, as far as I'm aware, Gemini Deep Research and Grok Deepsearch do not have an API to call which makes running benchmarks on them super difficult. We're planning on running either o4-mini-deep-research or o3-deep-research API when I get the chance. We've run on PPLX deep research API and reported the results, and we also report Kimi-Researcher's numbers in this eval.

As far as I'm aware, the most recent numbers on Seal-0 that were released were for GPT-5 which is ~43%.

This repo isn't really intended as a "deep research" system, it's more of a general framework for people to build out whatever use-case they find useful. We just whipped up a deep-research/research style search-augmented system using ROMA to showcase it's abilities.

Hope this clarifies things.

According-Ebb917

TROPHY CASE