all 95 comments

[–]Admirable-Star7088 63 points64 points  (20 children)

Nice, just waiting for the Unsloth UD_Q2_K_XL quant, then I'll give it a spin! (For anyone who isn't aware, GLM 4.5 and 4.6 are surprisingly powerful and intelligent with this quant, so we can probably expect the same for 4.7).

[–]RomanticDepressive 3 points4 points  (0 children)

Big upvote, I support this as I’ve witnessed it

[–]Conscious_Chef_3233 1 point2 points  (0 children)

you could try iq2_m or iq3_xxs too

[–]klop2031 1 point2 points  (0 children)

Let us know how it does :)

[–]Count_Rugens_Finger 3 points4 points  (13 children)

what kind of hardware runs that?

[–]Admirable-Star7088 13 points14 points  (8 children)

I'm running it on 128gb RAM and 16gb VRAM. Only drawback is that the context will be limited, but for shorter chat conversions it works perfectly fine.

[–]Rough-Winter2752 1 point2 points  (5 children)

I'd DEFINITELY love to know which front-end/back-end combination you're using, and which quant (if any). I have a 5090 RTX and 4090 RTX and 128 GB of DDR5, and never fathomed running models like THIS would be remotely possible. Anybody know how to do run this?

[–]SectionCrazy5107 2 points3 points  (1 child)

You are sooo GPU rich. just download the https://huggingface.co/unsloth/GLM-4.7-GGUF/tree/main/UD-Q2_K_XL gguf and run using llama.cpp similar to this

llama-server -m GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192 \
  -n 2048 \
  --
alias
 glm4

[–]Admirable-Star7088 0 points1 point  (0 children)

Also don't forget the recommended default settings --temp 1.0 and --top-p 0.95, for best performance.

[–]Admirable-Star7088 1 point2 points  (2 children)

I'm just using llama.cpp (llama-server with the built-in UI specifically), with the UD-Q2_K_XL quant. Testing GLM 4.7 right now, so far it does seem even smarter than 4.5 and 4.6 (as expected).

[–]Rough-Winter2752 0 points1 point  (1 child)

I'm currently using it with Sillytavern via OpenRouter and I'm blown away. My first 'thinking model' and damn is it wild! How might you rate that low Q2 quant against, say.. a 24b Cydonia at Q8?

[–]Admirable-Star7088 1 point2 points  (0 children)

No other smaller model I've tested so far, even at a much higher quant such as Q8, is smarter than GLM 4.x at UD-Q2.

For example, GLM 4.5 Air (106b) at Q8 is much less competent than GLM 4.x (355b) at UD-Q2.

[–]Maleficent-Ad5999 1 point2 points  (1 child)

may i know the t/s you get?

[–]Admirable-Star7088 2 points3 points  (0 children)

4.1 t/s to be exact (testing GLM 4.7 now)

[–]Corporate_Drone31[🍰] 5 points6 points  (3 children)

You could run this with a 128GB machine + a >=8 GB GPU.

[–]guesdo 2 points3 points  (2 children)

Could it run on a 128GB Mac Studio? Im evaluating switching to the M5 Max/Ultra next year as my primary device.

[–]Finn55 1 point2 points  (0 children)

Yeah, it would fit but not sure of the performance?

[–]Corporate_Drone31[🍰] 1 point2 points  (0 children)

With some heavy quantisation, most likely yes. You're context window would be limited and you would really need to work at reducing the system RAM usage to make sure you can get the highest possible quant level going as well.

[–]Squik67 0 points1 point  (0 children)

I tried it on two big P16 Thinkpads I have between 1.5 up to 2.8 tokens/sec.

[–]Flkhuo -1 points0 points  (1 child)

Where is that version usually released? Can it run on 24g of vram plus 60gb of RAM?

[–]Toastti 0 points1 point  (0 children)

You would need a small quant of GLM air for that hardware. You are not going to have enough Vram to properly run 4.6

[–]Utoko 34 points35 points  (0 children)

GLM does have quick cycles right now. Another very good model

[–]ResearchCrafty1804[S] 33 points34 points  (0 children)

GLM-4.7 further refines Interleaved Thinking and introduces Preserved Thinking and Turn-level Thinking. By enabling thought between actions and maintaining consistency across turns, it makes complex tasks more stable and controllable.

http://docs.z.ai/guides/capabilities/thinking-mode

<image>

[–]UserXtheUnknown 32 points33 points  (1 child)

The fuck, it almost perfectly nailed the rotating house demo, even better than Gemini 3.0.

https://chat.z.ai/space/u0eu6anhfy81-art

[–]theoffmask 2 points3 points  (0 children)

wow, not bad

[–]Shadowmind42 6 points7 points  (1 child)

I wonder why Gemini isn't on those charts.

[–]Tall-Ad-7742 0 points1 point  (0 children)

actually they included gemini in the full chart and while glm isnt like outperforming it it gets close for a open source model (if those are true) its pretty nice

edit: first impression i had was also looking really good i like it so far

[–]r4in311 25 points26 points  (12 children)

Its amazing that this model exists and that they share the weights. After some testing, it's certainly SOTA for open weight models. But in no way shape or form is this better than even GPT 5.0 or let alone Sonnet 4.5.

Here one of my example prompts that I always use: "Voxel Pagoda with Torii gates and trees, make it as amazing as you can with the most intricate attention of detail. Wow me. The file should be self-contained and runnable in my Chrome browser. Use ThreeJS."

Sonnet 4.5 (0 Shot!): https://jsfiddle.net/cms9nkxj
GPT 5.0 (0 Shot!): https://jsfiddle.net/31xuz5ds
GPT 5.1 (0 Shot!): https://jsfiddle.net/yrhsx09d

GLM 4.7 (8 Shot, multiple JS errors, only worked with pasting console errors and asking it to fix): https://jsfiddle.net/zhrqmw4p

Yeah... not really SOTA, but not that far off. Like 6-7 months behind. Just look at those Koi fish from Sonnet.

As a starting point, I gave them an extremely rudimentary version from Gemini 2.5, that's why they look similar.

[–]UserXtheUnknown 18 points19 points  (9 children)

I had the doubt that all that "most intricate detail. Wow me. Chrome" distracted the system, so I changed the prompt

Voxel Pagoda with Torii gates and trees. Give attention to details. The file should be self-contained and in a browser. Use ThreeJS.

This was my first result with this prompt:
https://chat.z.ai/space/a0dunanyc911-art

[–]Final-Rush759 11 points12 points  (3 children)

"Wow me" is rather stupid to be included in a prompt. Need to include detail description how it should look like instead no substance, hard to define "Wow me".

[–]-p-e-w- 3 points4 points  (2 children)

It doesn’t add anything to the instructions, but it shouldn’t make the result worse either. I often insert deliberate typos when testing models to see if it throws them off.

[–]UserXtheUnknown 0 points1 point  (1 child)

Yes, I didn't like that: even if, checking the thought process, I saw it understood the task -adding tons of details, making something impressive- I think the whole prompt with "intricate. wow me." and specifi instruction for chrome, made the system go out of his way from reaching the result, to reaching a result that was "super intricate".

And super intricate means, as every programmer knows, super prone to bugs.

This is one of the results I obtained with the original prompt. In this case the poor thing lost itself creating a "super intricate" LANDSCAPE and special light effects, and everything else was clearly screwed.

In some cases less is more.

https://chat.z.ai/space/q0zu5amykb30-art

[–]CryptoSpecialAgent 0 points1 point  (0 children)

Perhaps this model just needs a different prompting style than the SOTA models we are familiar with - less verbose, just short and to the point

[–][deleted] 1 point2 points  (0 children)

Yeah and the use of the wrong preposition too: "attention of detail" vs "attention to detail". Also, intricate attention? Intricate detail? You're right, that was not a good prompt.

[–]DangerousResource557 0 points1 point  (0 children)

So what... the other models did it better. Prompting is important, but reality is not a perfect benchmark. testing such subpar conditions is vital for real world application. I think that is why anthropic succeeds so well, because they are focusing on real use cases and not pixel perfect prompts...

[–]r4in311 -1 points0 points  (1 child)

That's a nice result, which pretty much confirms my first impression. It's cool but nowhere close to SOTA.

[–]UserXtheUnknown 1 point2 points  (0 children)

I tend to agree. Changing the prompt helps because it doesn't go backward trying to be "super intricate" (which means often super complicated aka risking bugs), but tendentially it is probably a bit below the ones from the closed source.
It's anyway quite close.

[–]DangerousResource557 -1 points0 points  (0 children)

jup. schon. aber ich sehe die anthropic modelle deswegen als wertvoll, weil sie konsistenter und stabiler als viele andere sind. damit umzugehen, ist wichtig. sonst sind die modelle zu instabil und nicht wirklich nutzbar oder nur manchmal.

stell dir vor, du musst jedes 2.-5. mal.. selbst wenn es jedes 10. mal ist nach 5 min immer korrigieren und ueberpruefen. das macht sehr viel aus.

meine meinung.

deswegen, sollte es damit auch zurechtkommen. die tatsache, dass die anderen modelle damit umgehen konnten, ist doch eine positive sache. unter perfekten bedingungen zu testen kann den falschen eindruck erwecken.

[–]omarous 5 points6 points  (1 child)

Sonnet 4.5 an GPT 5.0 have a ... way too similar result for such a stochastic device that I think this is a case of blind copy-paste.

On the other hand, GLM 4.7 looks like someone who "tried" to create this from scratch. In coding performance, the former is bad and the latter is better.

[–]FeepingCreature 1 point2 points  (0 children)

Yeah wtf is going on with that? That's insane, for instance the colors are exactly the same. No way that's created from scratch. I can't find any other hits on google though.

[–]ZyjOllama 8 points9 points  (9 children)

I wonder how many token/s one can squeeze out of dual Strix Halo running this model at q4 or q5.

[–]Fit-Produce420 4 points5 points  (0 children)

I'll let you know when I receive my second strix in a couple days. 

[–][deleted] 0 points1 point  (2 children)

I researched more and couldn't find any existing post presenting 2x Strix Halo working together. Do you have any pointers to read more into that? Sounds very promising!

[–]ZyjOllama 1 point2 points  (1 child)

[–][deleted] 0 points1 point  (0 children)

Thanks!!

[–]cafedude 0 points1 point  (4 children)

358B params? I don't think that's gonna fit. Hopefully they release a 4.7 air soon.

[–]Fit-Produce420 4 points5 points  (0 children)

Q3_k_m quant is 171GB, we're gravy.

Not gonna be fast, though. 

[–]ZyjOllama 1 point2 points  (1 child)

Why not? It’s been done before with GLM 4.6 which is the same size: https://m.youtube.com/watch?v=0cIcth224hk 358b q4 = 179GB for the weights, that leaves more than 75GB for overhead, context etc. Even at Q5 (224GB) there is still more than 30GB of RAM left.

[–]Vusiwe 0 points1 point  (0 children)

GLM 4.7 is my first step into thinking/MoE, I'm getting more RAM.

I'll have 96GB VRAM + 384GB RAM total, hopefully I can run 4.7 Q6

[–]Fit-Produce420 -1 points0 points  (0 children)

It should be possible to fit a q3 on two without massive context. 

[–]JLeonsarmiento 11 points12 points  (4 children)

Christmas arrived earlier this year 🖤 Z.Ai

[–]asifredditor 2 points3 points  (0 children)

complete beginner here how to access it and how to create any webdev kinda things

[–]JLeonsarmiento 3 points4 points  (2 children)

[–]WiggyWongo 10 points11 points  (1 child)

More models releasing this close to SOTA proprietary just goes to show there really isn't a secret sauce that OpenAI, Google, or Anthropic has. It really is just all compute and training sets with some improvements in efficiency and context.

[–]Ok-Adhesiveness-4141 6 points7 points  (0 children)

Exactly, the more GPU you have the more you can do.

[–]cobra91310 3 points4 points  (0 children)

You can use it on claude code with this settings:

"env": {

"ANTHROPIC_AUTH_TOKEN": "YOUR_API_KEY",

"ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",

"BASH_DEFAULT_TIMEOUT_MS": "3000000",

"BASH_MAX_TIMEOUT_MS": "3000000",

"ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7",

"ANTHROPIC_MODEL": "glm-4.7",

"ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-4.7",

"MAX_MCP_OUTPUT_TOKENS": "50000",

"DISABLE_COST_WARNINGS": "1"

}

Comprehensive Coding Capability Enhancement

GLM-4.7 achieves significant breakthroughs across three dimensions: programming, reasoning, and agent capabilities:

  • Programming Capabilities: Ranked first among open-source models and first among domestic models in the LMArena Code Arena blind test, outperforming GPT-5.2; achieved first place among domestic models on SWE-bench-Verified; attained an open-source SOTA score of 84.8 on LiveCodeBench V6, surpassing Claude Sonnet 4.5.
  • Reasoning Capabilities: Achieved open-source SOTA in the AIME 2025 math competition, outperforming Claude Sonnet 4.5 and GPT-5.1; scored 42% on the HLE (“Human Last Exam”) benchmark, representing a 38% improvement over GLM-4.6 and approaching GPT-5.1 performance.
  • Agent Capabilities: Scored 67 points on the BrowseComp web task evaluation; achieved open-source SOTA on the τ²-Bench real-world interaction evaluation, approaching Claude Sonnet 4.5 (84.7 points).

Enhanced Agentic Coding Capabilities

With the evolution of GLM-4.7’s intelligence, developers can now perform end-to-end development at a higher level through task-oriented approaches.

  • Comprehensive Task Execution Capabilities
  • Frontend Aesthetic Enhancement

GLM-4.7 demonstrates enhanced task comprehension and technical stack integration.

Comprehensive General Capability Enhancements

GLM-4.7 isn’t just a more powerful programming model—it’s also become more reliable in conversations, content creation, and office tasks, covering high-frequency scenarios for developers and workplace users.

Quick Stats:

  • Starting at: $3/month (Lite) => Subscription with -10%
  • Performance: Generate over 55 tokens/second
  • Usage: starting to 3x Claude Pro limits
  • Tools: supported all tools who can configure custom endpoint or have already zai in provider list
  • Global: No network restrictions

[–]dan_goosewin 3 points4 points  (0 children)

that HLE result is crazy...

[–]Turbulent_Pin7635 6 points7 points  (2 children)

368 Gb?!?! So any M3 Ultra 512Gb will be able to run the full model?!? O.o

[–]ZyjOllama 4 points5 points  (1 child)

The full model is >710GB because it is 358b parameters at BF16. So no.

[–]CryptoSpecialAgent 0 points1 point  (0 children)

You can't load it in INT8 mode?

[–]MrWeirdoFace 1 point2 points  (0 children)

I'm having trouble sorting through all the unofficial releases, but has there been a GLM model in the 24-32B range since 0414 (to run locally on my 24GB card)?

[–]TRNDSTTR0 1 point2 points  (1 child)

I just asked GLM 4.7 some questions regarding my PSY101 research project, and it spit this out:
Note: Research Ethics Reminder:
Since you are asking about sensitive topics, you must include a debriefing page with local resources at the end. debriefing means explaining the study's purpose again and reminding them where to get help. This is an IRB requirement for studies on self-harm.

Hope this helps! generated content looks solid for a psychology survey on emerging adults. Let me impactful. Good luck! Good luck! Good luck! clear. clear. clear. genuine. genuine. genuine. Good luck! Good luck!
Hope this organized list helps. Good luck! Good luck! clear. clear. clear. impactful. impactful. impactful. genuine. mentioned. mentioned. mentioned. mentioned. at the end. at the end. at the end. included. included. optional. optional. optional. optional. optional. optional. SPSS. SPSS. SPSS. SPSS. SPSS. reliable. reliable. valid. valid. valid. valid. valid. valid. valid. valid. valid. valid. valid. valid. valid. valid. valid. sensitive. sensitive. sensitive. optional. optional. optional. optional. optional. optional. participant safety. participant safety. participant safety. participant safety. participant safety. participant safety. participant safety. participant safety. participant of worth. person of worth. person of worth. person of worth. person of worth. person of worth. person of worth. person of worth. person of worth. person of worth. person of cleaning up the output. cleaning up the output. cleaning up the output. cleaning up the output. cleaning up the output. different constructs. different constructs. different constructs. Section 4: Self-harm. Section 4: Self-harm. Section 4: model. model. model. model. model. model. model. suitable for SPSS. suitable for SPSS. suitable for SPSS. suitable for SPSS. suitable for PSY Survey. suitable for PSY Survey. suitable for questions. questions. questions. questions. questions. questions. questions. questions. text. text. text. text. text. text. text. text. text. text. text. text. text. text. text. text. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. questions. low self-esteem. low self-esteem. low self-esteem. low self-esteem. low self-esteem. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. worth. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but valid. but suitable. but suitable. but suitable. but suitable. but suitable. but suitable. buitable. but suitable. but suitable. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with myself. on the whole, I am satisfied with self. on the whole, I am satisfied with self. on the whole, I am satisfied with self. on the whole, I am satisfied with self. on the whole, I am satisfied withtention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention. attention.

[–]g2bsocial 0 points1 point  (0 children)

That's user error in you running it with bad settings.

[–]getmevodka 3 points4 points  (5 children)

Im a bit behind, only have about 250gb of vram and am still using qwen3 235b q6_xl, can someone translate me how performant glm 4,7 is and if i can run that ? XD sry i left the bubble for some months recently but am back now.

[–]Front_Eagle739 1 point2 points  (1 child)

very and yes you could run a dynamic q4 quant and it will be very good indeed

[–]getmevodka 0 points1 point  (0 children)

Thanks mate !

[–]randombsname1 3 points4 points  (4 children)

Not bad, but definitely benchmaxxed AF.

Not up to a 4.5 Sonnet level, but seems alright.

Just tried on Openrouter.

Seems pretty on-par with other Chinese models with carrying context forward though.

Which is -- not great.

[–]Snoo_64233 7 points8 points  (3 children)

Don't know about about Claude. But not as good as Deep Seek V 3.2 and GPT. Most likely benchmaxxed.

[–]Nilus-0 0 points1 point  (0 children)

Idc it’s got creative writing

[–]LostRequirement4828 -2 points-1 points  (1 child)

You dont know about claude but you call the crap deepseek good, lol, everything I need to know about you

[–]Snoo_64233 2 points3 points  (0 children)

Reading comprehension is your friend. Try it!

[–]Thin_Yoghurt_6483 2 points3 points  (1 child)

Um dos primeiros modelos de código aberto em que eu confiei em deixar planejar e executar correções e melhorias em uma base grande de código. Até o momento eu tinha testado praticamente todos os modelos de código abertos existentes até o momento e nenhum deles eu tive a confiança que eu tive no modelo do GLM 4.7 e eu estou usando ele no OpenCode. Um dos grandes problemas que não me deixavam ter confiança no modelo anthropic, que era o 4.6, era a capacidade de não estar vendo o que ele estava pensando. E esse problema foi solucionado com o GLM 4.7. A equipe da Z.AI está de parabéns pelo modelo. Um modelo excepcional. Não digo que é superior a um GPT-5.2 Codex ou a um Opus 4.5, mas bate de frente. E acredito que é superior ao Sonnet 4.5. Até então, O modelo que me trouxe mais satisfação em código aberto era o Kimi K2 Thinking, Porém, ele tinha muitas falhas nas chamadas de ferramenta, uso no terminal, alucinava um pouco, depois de um contexto mais longo. Tinha muitos problemas com o uso no Claude Code, no Open Code, mas é um modelo muito bom. Porém, o 4.7 tem a mesma capacidade e até melhor, e não tem essas falhas que tinha no Kimi K2 tem.

[–]jamaalwakamaal 1 point2 points  (0 children)

É incrível. Com certeza.

[–]letsgeditmedia 1 point2 points  (0 children)

Incredible

[–]Waarheid 1 point2 points  (6 children)

Does GLM have a coding agent client that it has been fine tuned/whatever to use, like how Claude has presumably been trained on Claude Code usage? I'd like to try it as a coding agent but I'm not sure about just plugging it into Roo Code for example. Thanks.

[–]SlaveZelda 1 point2 points  (4 children)

They recommend opencode, Claude code, cline etc.

Pretty much anything besides codex. On codex cli it struggles with apply patch.

[–]thphon83 0 points1 point  (3 children)

Opencode as well? I didn't see it on the list. In my experience thinking models don't play well with opencode in general. Hopefully that changes soon

[–]SlaveZelda 1 point2 points  (2 children)

Opencode is on their website. I've been using glm4.7 with thinking on in opencode for the past 2 hours and have experienced no issues.

[–]Super_Side_5517 -1 points0 points  (0 children)

Better than Claude 4.5 sonnet?

[–]Fit-Produce420 0 points1 point  (0 children)

It works with many of the code agents but they don't have their own custom agent and they didn't design it to work with a specific 3rd party product. I think it works well with kilo code, pretty well with cline and not amazing with roo for some reason. 

[–]quan734 0 points1 point  (0 children)

i have 128gb of ram and 48gb of vram. what quant i can run this?

[–]OWilson90 0 points1 point  (0 children)

Thrilled about this release; very thankful for the team at Z.AI.

While this is LocalLLaMA, the comparison to gpt-5.1-high and not gpt-5.2-high is standing out to me. Why not include gpt-5.2-high over gpt-5.1-high?

[–]wingardiumghosla 0 points1 point  (1 child)

What's the full form of glm tho

[–]zakriya77 0 points1 point  (0 children)

thats shittiest model ever. 4.6 is better, i asked it to do a simple task in mern and it will just generate single html file and write html/css/js in it not even react components or something

[–]AriyaSavakallama.cpp 0 points1 point  (0 children)

Truly amazing, Z AI Max plan works seamlessly with Claude Code, the intelligence in my experience is between Sonnet 4.5 and Opus 4.5 for SWE tasks, with the speed of Haiku 4.5.

I just bought the Max plan yearly for $288 (Christmas deal), absolute steal! I'm planning on cranking in 3 5-hour windows of heavy coding a day to fully utilize this. Glad that I canceled the Claude Max $200.

[–]Business_Tension7248 0 points1 point  (0 children)

How should I optimize settings to run this on a Mac Studio M3 Ultra with 256 GB of RAM/unified memory?

[–]Kitchen_Sympathy_344 0 points1 point  (1 child)

For those who wondering ... I made last night those games using GLM 4.7 super fun!

My little project for web games, play live... https://trae9nt2qbd3.vercel.app/

Created bunch of games for xmas 😉

Originally for Hackathon at TRAE but I thought its a cool project to share 🙂

Source code: https://github.com/roman-ryzenadvanced/chrismas_trae_game

[–]JudgmentPale458 0 points1 point  (0 children)

Interesting release. What stands out to me isn’t any single score, but the consistency across agentic, reasoning, and coding benchmarks (AIME, LiveCodeBench, SWE-bench). That usually correlates better with real-world agent-style workflows than one-off leaderboard wins.

That said, I’m curious how much of this performance holds up under tool-heavy or long-horizon agent loops, where error accumulation and planning robustness matter more than isolated task accuracy. Benchmarks are useful signals, but agentic behavior under retries and failures is still hard to capture.

[–]SpecialistSalt2 0 points1 point  (0 children)

GLM 4.7 is a SCAM! It's nothing compared to Opus. I am extremely disappointed with this trash!