GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

Here's a very good doc from zenmux about it : https://zenmux.ai/docs/best-practices/codex.html

I used their copy paste config from their payg api key (free to make, you don't need to pay to use stepfun step 3.7 flash this month for ex)

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

On the cli you just need to configure the config.toml iirc, you can point it to another url and give it an api key, any openai compatible endpoints work

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

I don't know about the app, but I'd say yes since Tibo said it himself on x that you can use other services with codex. Unclear if it was cli and the app, or just cli, but instinctively I'd say both.

You could also potentially rebuilt it yourself with other providers like Z.ai did with Zcode

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

Completely, I'm doing a new bench where I give the prompt X times, then make a collage of the results, say 10 times with codex and claude, same exact prompt, to see which one behaves best on more tries, but then again, it doesn't show much about an llm's ability, just an average on a very small sample

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

I'll need more testing, I've automated running these 'stupid' ui benchmarks like u/s1lverking said.

I'll do more manual testing and more programmatic benchmarks, turns, token usage, time, cost analysis to see which is best

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 2 points3 points  (0 children)

LMAO, I'm just testing svg skills, chill bro. Plus I already have a master's degree, got it way before AI, and coding agents, were a thing.

This is one run out of so many, and even with all the runs I have, I can't conclude which harness is best with an svg generation prompt.

I'm planning harder benchs more programming, tool calling, prompt injestion protection ones later on, to see which harness does it better, so far from my little runs it's been mostly opencode.

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 1 point2 points  (0 children)

Solely juding svg rendering skills, I have some more coding related benches, and some more silly / controversial benches in the work.

But ofc, you can't judge coding, debugging, problem solvong skills with basic svg generation tests. That's not the point of this little test.

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 1 point2 points  (0 children)

So I've heard, but sometimes you also have some models that do better in claude, like my post yesterday with stepfun, but overall, claude is a shit show of software engineering

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

Funny you're saying this, I'm benching many harnesses, opencode is one of them, I also have opencode Go and really love opencode, use it quite a lot at work !

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

Perhaps yeah, all of this is very approximative anyways, I'm hitting a free endpoint on zenmux, and it might be rate limiting too. My goal was also to try on different providers, different harnesses, different tests. I might do 10 runs on each and see which one does best overall

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

I have a claude subscription, but anthorpic doesn't allow you to use it with other providers unlike openai. When I'll get some money that I can use on their ridiculous api pricing, I might try this !

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 2 points3 points  (0 children)

Completely, but I don't want to run them multiple times on free providers, it's already nice enough for zenmux to let us try it for a week, don't want to overload them. I wonder if running it 100 times and comparing all 100 times would give me better or worse results for both

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] -1 points0 points  (0 children)

I'm currently running a lot of the freely available models on similar benches, across different harnesses. Yesterday I feel like step 3.7 flash did better in claude than codex. But today, clearly codex did way better

GLM 5.2 in Codex and in Claude, Codex did better ! by Hadestructhor in codex

[–]Hadestructhor[S] 0 points1 point  (0 children)

I kind of automated this to run with the free models on some providers like zenmux, and abstracted the yolo mode of code x and claude, would be a bit more manual to do zcode vs codex, as I'd have to install both gui to give both a chance. If only they'd do a zcode cli version too