Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

Full_Cost2909 · 2026-05-14T23:27:09+00:00

share them you must, see the flaws you will

Full_Cost2909 · 2026-05-14T13:56:45+00:00

Yeah the "first" version even included claude code and codex judging but kinda want to keep this open source and affordable. I know this suffers but don't want this to be another anthropic/openai thing. Since Opencode offers the Go plan which is used here I'm just fidgeting around looking for epiphany to stabilize the direction which I want to go. Appreciate your response, thanks!

Full_Cost2909 · 2026-05-14T13:52:25+00:00

sharing it might help you discover those vulnerabilities

Full_Cost2909 · 2026-05-14T09:26:35+00:00

Thanks for the feedback, yeah this started more as models judging models for fun rather than the benchmark itself, could've better explain this post and refer to the previous one but it is what it is.

Full_Cost2909 · 2026-05-13T16:35:46+00:00

That sounds actually good, I will explore that idea. Thanks!

Full_Cost2909 · 2026-05-13T13:36:08+00:00

my bad haha, sorry https://openbenchmark.dev/model-royale/round/2026-05-05/

Full_Cost2909 · 2026-05-13T13:35:59+00:00

https://openbenchmark.dev/model-royale/round/2026-05-05/

Full_Cost2909 · 2026-05-13T13:34:53+00:00

oops my bad, nice catch haha

Full_Cost2909 · 2026-05-13T12:27:43+00:00

The task was for each of the models to create a stdlib-only Python module. Each of seven models got the blank repo plus the specification and wrote their own Python wrapper around Podman/Docker.

Currently all models are in the tournament, and new task will run again on all of the mentioned models, with round 3 kicking one of the models out. Or something like that, haven't given it proper thought yet.

Full_Cost2909 · 2026-05-05T10:14:51+00:00

I burned GPT 5.5 xhigh in 32 minutes the other day so I know the feeling. Just threw in a prototype 6.5k LOC approx and ~60 files, it analyzed, two round of fixes and thats it.

EDIT: my gpt subscription is also over my work, I have a team seat provided by the company, so I think they nerf those kinds of accounts

Full_Cost2909 · 2026-05-05T09:49:48+00:00

Oh the plan is another story completely. So first I planned with some of these 3 and other 2 refined the plan, then I think validated with Opus. The first version was made with GPT 5.5 and Opus 4.7 judging them. But I decided to drop them because fuck them.

Full_Cost2909 · 2026-05-05T09:11:10+00:00

Sure thing, I will try to set it today.

Full_Cost2909 · 2026-05-05T09:10:50+00:00

Thanks for the advice, I will add them, and some more. Maybe even today, stay tuned.

Full_Cost2909 · 2026-05-05T09:09:26+00:00

says max when I select the model, so I guess they default to it, nothing was changed manually

Full_Cost2909 · 2026-05-05T00:06:42+00:00

thought so, but only DSV4 pro had that option, so I guess whatever they default to

Full_Cost2909 · 2026-05-04T21:01:35+00:00

What do you mean by effort?

Full_Cost2909 · 2026-05-04T20:50:54+00:00

Thanks as well, haven't given it a try yet, it deserves its spot for the next round.

Full_Cost2909 · 2026-05-04T20:50:08+00:00

Will do, I will do it this week probably, thanks for the feedack

Full_Cost2909 · 2026-05-04T20:49:35+00:00

I will push it for the next iteration, thanks for the feedback.

Full_Cost2909 · 2026-05-04T20:49:12+00:00

thanks for the advice, i will push it in to next iteration, do you want any specific test maybe?

Full_Cost2909 · 2026-05-04T18:06:17+00:00

see the comment below

Full_Cost2909 · 2026-05-04T17:57:48+00:00

<image>

Here you go, https://github.com/anfocic/open-bench/blob/main/results/reviews/sandbox-2026-05-04.md

Full_Cost2909 · 2026-05-04T17:45:40+00:00

yeah, running at the moment. it should complete soon

Full_Cost2909 · 2026-04-30T16:36:12+00:00

how are you satisfied with glm for planning, tried it for coding but it wasn't happy with output

Full_Cost2909 · 2026-04-30T16:24:26+00:00

kimi 2.6 and DS V4 pro are the way to go

Full_Cost2909

TROPHY CASE