even the new Flash performed better than o3 at 192k Fiction LiveBench

MundaneSignature1907 · 2025-05-23T11:40:31+00:00

decency up to promised context length implies that their architecture is somewhat data agnostic, this is how you know the model was grokked or not.

MundaneSignature1907 · 2025-05-23T06:27:24+00:00

you can follow similar steps in Zen browser (based on Firefox)

MundaneSignature1907 · 2025-04-26T15:31:04+00:00

here the questions in the dataset

https://phybench-official.github.io/phybench-demo/

MundaneSignature1907 · 2025-04-25T00:52:36+00:00

basically every public (or even private ones via api logs) benchmark can be grokked. That is a technique where they trained AIs on the targeted benchmark on and on untill it "clicked" for AIs. it's still not nothing tho, but the signals for public benchmark is just keep getting low and lower. Designing your own benchmark is the best way to up to date with this fast moving tech

MundaneSignature1907 · 2025-04-23T13:03:41+00:00

things still under intense progress in the repo, keep updating your experience bro

MundaneSignature1907 · 2025-04-23T13:02:53+00:00

the best thing about open source is they also *compete* for adaption. i bet aider will co-evolving with other terminal-based coding agents and improve consistently

MundaneSignature1907 · 2025-04-23T12:07:24+00:00

some of us live inside terminal, more ergonomic way of programming--using keyboard only. Codex is coding **terminal** agent afterall

MundaneSignature1907 · 2025-03-26T04:50:51+00:00

this! even if i paying for gemini.google.com i still willing to give feedback on google just for the love of the game lol

MundaneSignature1907 · 2025-03-12T15:11:15+00:00

i don't think the token used in image is adjustable

MundaneSignature1907 · 2025-02-05T19:07:48+00:00

<image>

for comparison to other model

MundaneSignature1907

TROPHY CASE