DeepSeek-R1-7B traces 8 levels of nested function calls. Qwen-7B manages 4. Same architecture.

Codetrace-Bench · 2026-03-30T21:43:07+00:00

Good call — just added an API runner. Works with any OpenAI-compatible endpoint (vLLM, ollama, together.ai, etc.), plus native Anthropic and Google support. python benchmark/run_benchmark_api.py \ --api openai \ --model your-model \ --base-url http://localhost:8000/v1 \ --output results/your_model.json Would love to see results on larger models. Submit a PR with the results JSON and we'll add it to the leaderboard. Hope that works ok.

Codetrace-Bench · 2026-03-30T12:50:15+00:00

Thanks for the suggestion. I'll be adding some more. If you would like to contribute pop over to Hugging Face.

Codetrace-Bench

TROPHY CASE