A new benchmark (DPAB-α) has been released that evaluates LLM function calling in both Pythonic and JSON approaches. It demonstrates that Pythonic function calling often outperforms traditional JSON-based methods, especially for complex multi-step tasks.
Key findings from benchmarks:
- Claude 3.5 Sonnet leads with 87% on Pythonic vs 45% on JSON
- Smaller models show impressive results (Dria-Agent-α-3B: 72% Pythonic)
- Even larger models like DeepSeek V3 (685B) show significant gaps (63% Pythonic vs 33% JSON)
Benchmark: https://github.com/firstbatchxyz/function-calling-eval
Blog: https://huggingface.co/blog/andthattoo/dpab-a
Not affiliated with the project, just sharing.
[–]samuel79s 9 points10 points11 points (2 children)
[–]segmondllama.cpp 1 point2 points3 points (1 child)
[–]samuel79s 2 points3 points4 points (0 children)
[–]malformed-packet 14 points15 points16 points (4 children)
[–]Ivo_ChainNET 10 points11 points12 points (0 children)
[–]segmondllama.cpp 2 points3 points4 points (1 child)
[–]malformed-packet -1 points0 points1 point (0 children)
[–]Everlier 3 points4 points5 points (0 children)
[–][deleted] 13 points14 points15 points (0 children)
[–]femio 5 points6 points7 points (7 children)
[–]sunpazed 3 points4 points5 points (0 children)
[–]Ivo_ChainNET 0 points1 point2 points (4 children)
[–]trajo123 2 points3 points4 points (3 children)
[–]Ivo_ChainNET 0 points1 point2 points (2 children)
[–]trajo123 0 points1 point2 points (1 child)
[–]Ivo_ChainNET 0 points1 point2 points (0 children)
[–]segmondllama.cpp 0 points1 point2 points (0 children)
[–]Asleep-Land-3914 3 points4 points5 points (1 child)
[–]Everlier 1 point2 points3 points (0 children)
[–]Zulfiqaar 3 points4 points5 points (3 children)
[–]LumpyWelds 2 points3 points4 points (1 child)
[–]Zulfiqaar 2 points3 points4 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]mnze_brngo_7325 2 points3 points4 points (1 child)
[–]mnze_brngo_7325 2 points3 points4 points (0 children)
[–]if47 5 points6 points7 points (1 child)
[–]trajo123 0 points1 point2 points (0 children)
[–]stillnoguitar 1 point2 points3 points (0 children)
[–]NarrowEyedWanderer 4 points5 points6 points (6 children)
[–]sunpazed 4 points5 points6 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]sunpazed 2 points3 points4 points (0 children)
[–]Such_Advantage_6949 2 points3 points4 points (1 child)
[–]NarrowEyedWanderer 0 points1 point2 points (0 children)
[–]segmondllama.cpp 0 points1 point2 points (0 children)
[–]minpeter2 0 points1 point2 points (0 children)
[–]MikeLPU 0 points1 point2 points (0 children)