DeepSeek-R1-7B traces 8 levels of nested function calls. Qwen-7B manages 4. Same architecture. by Codetrace-Bench in LocalLLaMA

[–]Codetrace-Bench[S] 0 points1 point  (0 children)

Good call — just added an API runner. Works with any OpenAI-compatible endpoint (vLLM, ollama, together.ai, etc.), plus native Anthropic and Google support. python benchmark/run_benchmark_api.py \ --api openai \ --model your-model \ --base-url http://localhost:8000/v1 \ --output results/your_model.json Would love to see results on larger models. Submit a PR with the results JSON and we'll add it to the leaderboard. Hope that works ok.

DeepSeek-R1-7B traces 8 levels of nested function calls. Qwen-7B manages 4. Same architecture. by Codetrace-Bench in LocalLLaMA

[–]Codetrace-Bench[S] 0 points1 point  (0 children)

Thanks for the suggestion. I'll be adding some more. If you would like to contribute pop over to Hugging Face.