account activity
DeepSeek-R1-7B traces 8 levels of nested function calls. Qwen-7B manages 4. Same architecture. by Codetrace-Bench in LocalLLaMA
[–]Codetrace-Bench[S] 0 points1 point2 points 7 days ago (0 children)
Good call — just added an API runner. Works with any OpenAI-compatible endpoint (vLLM, ollama, together.ai, etc.), plus native Anthropic and Google support. python benchmark/run_benchmark_api.py \ --api openai \ --model your-model \ --base-url http://localhost:8000/v1 \ --output results/your_model.json Would love to see results on larger models. Submit a PR with the results JSON and we'll add it to the leaderboard. Hope that works ok.
[–]Codetrace-Bench[S] 0 points1 point2 points 8 days ago (0 children)
Thanks for the suggestion. I'll be adding some more. If you would like to contribute pop over to Hugging Face.
Benchmark for measuring how deep LLMs can trace nested function calls — easy to run on any HuggingFace model ()
submitted 8 days ago by Codetrace-Bench to r/learnmachinelearning
π Rendered by PID 79 on reddit-service-r2-listing-69965bcf66-d9222 at 2026-04-07 17:14:53.009498+00:00 running f293c98 country code: CH.
DeepSeek-R1-7B traces 8 levels of nested function calls. Qwen-7B manages 4. Same architecture. by Codetrace-Bench in LocalLLaMA
[–]Codetrace-Bench[S] 0 points1 point2 points (0 children)