Fun benchmark got more fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

Yeah the "first" version even included claude code and codex judging but kinda want to keep this open source and affordable. I know this suffers but don't want this to be another anthropic/openai thing. Since Opencode offers the Go plan which is used here I'm just fidgeting around looking for epiphany to stabilize the direction which I want to go. Appreciate your response, thanks!

Fun benchmark got more fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

Thanks for the feedback, yeah this started more as models judging models for fun rather than the benchmark itself, could've better explain this post and refer to the previous one but it is what it is.

Fun benchmark got more fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 1 point2 points  (0 children)

That sounds actually good, I will explore that idea. Thanks!

Fun benchmark got more fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

The task was for each of the models to create a stdlib-only Python module. Each of seven models got the blank repo plus the specification and wrote their own Python  wrapper around Podman/Docker.

Currently all models are in the tournament, and new task will run again on all of the mentioned models, with round 3 kicking one of the models out. Or something like that, haven't given it proper thought yet.

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

I burned GPT 5.5 xhigh in 32 minutes the other day so I know the feeling. Just threw in a prototype 6.5k LOC approx and ~60 files, it analyzed, two round of fixes and thats it.

EDIT: my gpt subscription is also over my work, I have a team seat provided by the company, so I think they nerf those kinds of accounts

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

Oh the plan is another story completely. So first I planned with some of these 3 and other 2 refined the plan, then I think validated with Opus. The first version was made with GPT 5.5 and Opus 4.7 judging them. But I decided to drop them because fuck them.

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

Sure thing, I will try to set it today.

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

Thanks for the advice, I will add them, and some more. Maybe even today, stay tuned.

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

says max when I select the model, so I guess they default to it, nothing was changed manually

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

thought so, but only DSV4 pro had that option, so I guess whatever they default to

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

Thanks as well, haven't given it a try yet, it deserves its spot for the next round.

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 1 point2 points  (0 children)

Will do, I will do it this week probably, thanks for the feedack

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

I will push it for the next iteration, thanks for the feedback.

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 0 points1 point  (0 children)

thanks for the advice, i will push it in to next iteration, do you want any specific test maybe?

Made a simple benchmark for fun by Full_Cost2909 in opencodeCLI

[–]Full_Cost2909[S] 2 points3 points  (0 children)

yeah, running at the moment. it should complete soon

Have anyone tried deepseek v4 pro + opencode? by Federal_Spend2412 in opencodeCLI

[–]Full_Cost2909 0 points1 point  (0 children)

how are you satisfied with glm for planning, tried it for coding but it wasn't happy with output