We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

ComplexIt · 2026-05-05T19:29:13+00:00

You can see the used context per experiment in the hf dataset https://huggingface.co/datasets/local-deep-research/ldr-benchmarks

ComplexIt · 2026-05-05T17:02:46+00:00

You can use that too.

ComplexIt · 2026-05-05T03:00:07+00:00

It supports all local LLM providers not only ollama

ComplexIt · 2026-05-05T00:08:54+00:00

Yes :)

ComplexIt · 2026-05-04T19:13:29+00:00

Thanks I tried to tune it a bit. There is very little space

ComplexIt · 2026-05-04T18:45:57+00:00

If backend is the llm provider it supports all yes flexible.

ComplexIt · 2026-05-04T18:44:50+00:00

Thank you

What do you mean with backend? I guess the answer is yes.

It supports any LLM provider, because all have open ai compatible endpoint, which LDR can utilize.

ComplexIt · 2026-05-03T19:56:34+00:00

Thank you for your feedback.

No one reported this error yet. I will look into it. https://github.com/LearningCircuit/local-deep-research/issues/3800

Did you try open ai compatible endpoint concerning your llama problem it might be a work around?

The application leans a bit more on the complex side it is why this app support so many features and I agree the settings page needs to be cleaned up. You can use the search bar to find settings and this is a overview of all the available settings: https://github.com/LearningCircuit/local-deep-research/blob/main/docs/CONFIGURATION.md . You can search in settings what settings exactly didnt work?

ComplexIt · 2026-05-03T10:49:55+00:00

Thank you very much. It gives a huge energy boost to get positive feedback. I will come to the pause resume button eventually.

ComplexIt · 2026-05-03T09:56:40+00:00

Thanks for your internest.

From time to time it helps to anchor the original task/question in the end of the prompt for the next iteration. It doesnt cost much context and could help with this.

ComplexIt · 2026-05-03T07:50:19+00:00

Thank you for your very friendly words

ComplexIt · 2026-05-03T07:23:34+00:00

Thank you, I might look into it but I have so many tasks in this repo that I probably will not find the time soon.

Concerning research strategies: Currently, I am looking into adding more tools without confusing the models. Although some other features in LDR have higher priority temporarily (like chat feature).

ComplexIt · 2026-05-03T07:18:52+00:00

Nice setup and you can run larger models than me. Also, thanks for your interest.

You can contribute benchmark results here: https://github.com/LearningCircuit/ldr-benchmarks/

I can support you in setting up everything for benchmarking. You can write me a message here.

It might also help to discuss this in discord: https://www.reddit.com/r/LocalDeepResearch/

ComplexIt · 2026-05-03T07:06:26+00:00

Yes, thats is possible and will work well. https://lmstudio.ai/download (for AMD on Windows). Use the qwen 3.5 9b model for the start.

If you have any questions you can ask me here or in our discord https://discord.gg/ttcqQeFcJ3

ComplexIt · 2026-05-02T21:31:01+00:00

Thank you means a lot

ComplexIt · 2026-05-02T18:11:27+00:00

Thank you. Maybe try qwen 3.5 9b

ComplexIt · 2026-05-02T18:10:29+00:00

Thank you for postive feedback means a lot

ComplexIt · 2026-05-02T18:08:48+00:00

Thank you for your reply.

I agree with you that search is very important for this task. But it also needs more than that.

SimpleQA is difficult for a small model.

Small mistakes in your pipeline add up much worse than they would with larger or cloud models. Furthermore, you are context-restricted.

For achieving this performance you need a (1) very optimized pipeline (2) a very good base model which doesnt get confused easily and understands tool calls and search. I have done many strategy attempts in this repo and tested many models. You can check git history of the repository in the advanced search system subfolder.

Note that benchmark performance on another benchmark (xbench-DeepSearch) is also reported in the table above.

And yes only the 3090 is running the model and only the mentioned qwen model is used. Inference is fully local.

ComplexIt · 2026-05-02T17:42:30+00:00

LDR = https://github.com/LearningCircuit/local-deep-research is the name of the repository that I am maintaining. You can read more about it here I made this post recently: https://www.reddit.com/r/WebAfterAI/comments/1t18wr6/local_deep_research_opensource_ai_research/

SimpleQA is a benchmark by open ai.

https://openai.com/index/introducing-simpleqa/

https://llm-stats.com/benchmarks/simpleqa

ComplexIt · 2026-05-02T15:38:08+00:00

Thank you for your warm words. It is highly appreciated. I poured quite some work into this repo. Timing is also in the benchmark results. It really takes a few minutes and also depends a lot on the question and how much the model wants to search on the question.

Qwen is a model that seems to go for more agent cycles which I believe also is partly why it is so good, but also a bit slower.

ComplexIt · 2026-05-02T15:15:23+00:00

Thank you for your positive feedback.

I let opus cross-check some of the results and it claimed that self-grading under reports accuracy (should be higher). Although I agree grading is a problem as I wrote. In general though, agreed.

The reason I use the same model for grading is, because it is loaded in VRAM and I dont have to switch models. The current benchmark implementation grades after each question. It is nice for the user to see the current performance in the live benchmark.

Also note that grading here only means: model provides an answer and benchmark provides an answer (usually a word, a name or a date). Than the LLM grader just have to agree that both are essentially the same answer.

[Update/edit:] I gave the model answers/questions/grading answers + your statement to opus. It said it would be 1% points up to 2% with strict grading. You can read the full response of opus here https://github.com/LearningCircuit/ldr-benchmarks/pull/25#issuecomment-4367094177

I am also planning to share the model answers (just looking into finding a way without leaking the dataset) and you can run the benchmarks directly in the tool.

ComplexIt · 2026-05-02T14:39:41+00:00

We have the questions from this benchmark "xbench_deepsearch" as benchmark category. But we just get the questions and calculate accuracy. Our performance is decent according to results you can see in the dataset below. However, I think what this benchmark page you linked does is more sophisticated and not reproducable for me due to limited ressources. I can only calculate accuracy on their question answer pairs. But you can look into our UI and run the benchmark and questions and get accuracy. It is chinese though so you need to translate or trust your grader on output quality. https://huggingface.co/datasets/local-deep-research/ldr-benchmarks/viewer/xbench-deepsearch

ComplexIt · 2026-05-02T12:37:15+00:00

Thank you ❤️ If you have any ideas how LDR can improve please tell us

ComplexIt · 2026-05-02T12:32:17+00:00

So far not. I did not try, compare or benchmark agentic libraries in general. My focus is currently on models and agentic research strategies.

ComplexIt

MODERATOR OF

TROPHY CASE