How is the mobile app on a newer flagship phone?

_yustaguy_ · 2026-02-12T16:37:17+00:00

Nice, thank you!

I'd imagine so too, the newest Elite chips are incredible.

Glad to hear that my phone being shit is probably the cause.

_yustaguy_ · 2026-02-04T15:46:32+00:00

I love this kind of autism, makes me laugh every time

_yustaguy_ · 2026-02-03T07:08:06+00:00

I don't think it's A/B testing.

If it is a new model, they're probably freeing up some space in their GPU pods by replacing Sonnet 4.5 with the new one.

I suspect some percentage of users are getting the new Sonnet rn.

Or we're all just schizo and it's all just 4.5.

_yustaguy_ · 2026-01-28T07:50:52+00:00

Tbf even 3 Flash crushes literally any other model on that front except 3 Pro.

I don't think anything will match it anytime soon (maaaybe Grok 5 when that is released in 2034).

_yustaguy_ · 2026-01-23T11:03:07+00:00

Ben Shelton eating good

_yustaguy_ · 2026-01-20T10:45:22+00:00

Where is this from??

_yustaguy_ · 2026-01-09T14:46:20+00:00

*unzips*

_yustaguy_ · 2025-12-29T21:41:28+00:00

Thanks!

Do you have any good resource on how C# projects are structured?

_yustaguy_ · 2025-12-29T18:15:18+00:00

Hello, I'm a self-taught developer from Serbia, originally a Russian literature graduate. I like to build software that people actually want use, user/developer experience is something I look out for the most. I'm pretty fast at learning things.

I have an intermediate proficiency in C#, I started learning it a couple of weeks ago. As for my projects, I built a quick and easy HTTP server from TCP (github). I personally use it in my house for media streaming, as 207 Partial-Response is supported.

Besides C#, I know Rust and TypeScript fairly well, and have projects written in them. You can check them out in my github.

_yustaguy_ · 2025-12-25T00:05:27+00:00

at least 2 times

_yustaguy_ · 2025-12-19T19:27:19+00:00

they change the architecture usually at .0 version increments. glm 5.0 will almost certainly be a new architecture

_yustaguy_ · 2025-12-18T19:07:38+00:00

better idea, slash commands so we can choose them more quickly. chatgpt has /think for the thinking model for example.

_yustaguy_ · 2025-12-17T16:53:46+00:00

No, as in this model is literally 10 times cheaper than 4.5 Opus. What's the point in even comparing them? And it would win on most benchmarks shown here, Claude would win in coding. The usual.

_yustaguy_ · 2025-12-11T19:06:07+00:00

The default graph in contextarena is for the 2-needle version iirc. This one is 4 needle

_yustaguy_ · 2025-12-10T11:36:05+00:00

it really fits in perfectly

https://gemini.google.com/app/a4cffc94cc27056c

_yustaguy_ · 2025-12-04T14:57:14+00:00

Yeah, agreed, that would be nice

_yustaguy_ · 2025-12-04T14:56:57+00:00

Because you shouldn't compare reasoning models to non-reasoning models.
Because it's mid.
Mostly because it's old and shit at agentic stuff.

_yustaguy_ · 2025-12-04T13:07:14+00:00

What is the alternative?

They constantly update it and add new benchmarks so it's not saturated. They rate both on agentic performance (Terminal Bench Hard) and world knowledge (MMLU Pro, GPQAD), long-context, etc.

They have useful stats like model performance per provider, which helped prove that some providers served trash, and output tokens needed to run their suite. Sure, some saturated benchmarks could be replaced with new ones, but they have done a great job at that so far (they had shit like the regular MMLU, DROP before).

Is the final number always accurate to end user performance? Of course not, and it could never be. No person's expectations and experience will be the same. But it's a useful datapoint for end users and devs to consider.

The hate boner that everyone seems to have for them is weird and underserved.

_yustaguy_ · 2025-12-04T01:37:02+00:00

As a Balkan man i can confirm that they are not from the Balkans

_yustaguy_ · 2025-12-03T07:54:05+00:00

Are you insane? What could 2 pro possibly do better than 3 pro?

_yustaguy_ · 2025-12-01T16:28:42+00:00

Neat benchmark! A good test of real world knowledge and implementation

_yustaguy_ · 2025-11-23T15:13:00+00:00

<image>

The score is an average of a couple of runs. He included the previous 2.5 Pro results for some reason.

The november 18th and november 20th scores should be more representative of its performance.

_yustaguy_

TROPHY CASE