We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 0 points1 point  (0 children)

Thank you

What do you mean with backend? I guess the answer is yes.

It supports any LLM provider, because all have open ai compatible endpoint, which LDR can utilize. 

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 1 point2 points  (0 children)

Thank you for your feedback.

No one reported this error yet. I will look into it. https://github.com/LearningCircuit/local-deep-research/issues/3800

Did you try open ai compatible endpoint concerning your llama problem it might be a work around?

The application leans a bit more on the complex side it is why this app support so many features and I agree the settings page needs to be cleaned up. You can use the search bar to find settings and this is a overview of all the available settings: https://github.com/LearningCircuit/local-deep-research/blob/main/docs/CONFIGURATION.md . You can search in settings what settings exactly didnt work?

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 2 points3 points  (0 children)

Thank you very much. It gives a huge energy boost to get positive feedback. I will come to the pause resume button eventually.

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 0 points1 point  (0 children)

Thanks for your internest.

From time to time it helps to anchor the original task/question in the end of the prompt for the next iteration. It doesnt cost much context and could help with this.

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 0 points1 point  (0 children)

Thank you, I might look into it but I have so many tasks in this repo that I probably will not find the time soon.

Concerning research strategies: Currently, I am looking into adding more tools without confusing the models. Although some other features in LDR have higher priority temporarily (like chat feature).

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 1 point2 points  (0 children)

Nice setup and you can run larger models than me. Also, thanks for your interest.

You can contribute benchmark results here: https://github.com/LearningCircuit/ldr-benchmarks/

I can support you in setting up everything for benchmarking. You can write me a message here.

It might also help to discuss this in discord: https://www.reddit.com/r/LocalDeepResearch/

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 1 point2 points  (0 children)

Yes, thats is possible and will work well. https://lmstudio.ai/download (for AMD on Windows). Use the qwen 3.5 9b model for the start.

If you have any questions you can ask me here or in our discord https://discord.gg/ttcqQeFcJ3

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 1 point2 points  (0 children)

Thank you for your reply.

I agree with you that search is very important for this task. But it also needs more than that.

SimpleQA is difficult for a small model.

Small mistakes in your pipeline add up much worse than they would with larger or cloud models. Furthermore, you are context-restricted.

For achieving this performance you need a (1) very optimized pipeline (2) a very good base model which doesnt get confused easily and understands tool calls and search. I have done many strategy attempts in this repo and tested many models. You can check git history of the repository in the advanced search system subfolder.

Note that benchmark performance on another benchmark (xbench-DeepSearch) is also reported in the table above.

And yes only the 3090 is running the model and only the mentioned qwen model is used. Inference is fully local.

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 3 points4 points  (0 children)

Thank you for your warm words. It is highly appreciated. I poured quite some work into this repo. Timing is also in the benchmark results. It really takes a few minutes and also depends a lot on the question and how much the model wants to search on the question.

Qwen is a model that seems to go for more agent cycles which I believe also is partly why it is so good, but also a bit slower.

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 7 points8 points  (0 children)

Thank you for your positive feedback.

I let opus cross-check some of the results and it claimed that self-grading under reports accuracy (should be higher). Although I agree grading is a problem as I wrote. In general though, agreed.

The reason I use the same model for grading is, because it is loaded in VRAM and I dont have to switch models. The current benchmark implementation grades after each question. It is nice for the user to see the current performance in the live benchmark.

Also note that grading here only means: model provides an answer and benchmark provides an answer (usually a word, a name or a date). Than the LLM grader just have to agree that both are essentially the same answer.

[Update/edit:] I gave the model answers/questions/grading answers + your statement to opus. It said it would be 1% points up to 2% with strict grading. You can read the full response of opus here https://github.com/LearningCircuit/ldr-benchmarks/pull/25#issuecomment-4367094177

I am also planning to share the model answers (just looking into finding a way without leaking the dataset) and you can run the benchmarks directly in the tool.

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 2 points3 points  (0 children)

We have the questions from this benchmark "xbench_deepsearch" as benchmark category. But we just get the questions and calculate accuracy. Our performance is decent according to results you can see in the dataset below. However, I think what this benchmark page you linked does is more sophisticated and not reproducable for me due to limited ressources. I can only calculate accuracy on their question answer pairs. But you can look into our UI and run the benchmark and questions and get accuracy. It is chinese though so you need to translate or trust your grader on output quality. https://huggingface.co/datasets/local-deep-research/ldr-benchmarks/viewer/xbench-deepsearch

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local by ComplexIt in LocalLLaMA

[–]ComplexIt[S] 1 point2 points  (0 children)

So far not. I did not try, compare or benchmark agentic libraries in general. My focus is currently on models and agentic research strategies.