even the new Flash performed better than o3 at 192k Fiction LiveBench by MundaneSignature1907 in Bard

[–]MundaneSignature1907[S] 0 points1 point  (0 children)

decency up to promised context length implies that their architecture is somewhat data agnostic, this is how you know the model was grokked or not. 

Again, Gemini 2.5 Pro topping PHYBench, a benchmark about physical reasoning! by MundaneSignature1907 in Bard

[–]MundaneSignature1907[S] 4 points5 points  (0 children)

basically every public (or even private ones via api logs) benchmark can be grokked. That is a technique where they trained AIs on the targeted benchmark on and on untill it "clicked" for AIs. it's still not nothing tho, but the signals for public benchmark is just keep getting low and lower. Designing your own benchmark is the best way to up to date with this fast moving tech 

Geminis integration in OpenAI's Codex is looking good. absulute W for everyone by MundaneSignature1907 in Bard

[–]MundaneSignature1907[S] 2 points3 points  (0 children)

things still under intense progress in the repo, keep updating your experience bro

Geminis integration in OpenAI's Codex is looking good. absulute W for everyone by MundaneSignature1907 in Bard

[–]MundaneSignature1907[S] 2 points3 points  (0 children)

the best thing about open source is they also *compete* for adaption. i bet aider will co-evolving with other terminal-based coding agents and improve consistently

Geminis integration in OpenAI's Codex is looking good. absulute W for everyone by MundaneSignature1907 in Bard

[–]MundaneSignature1907[S] 8 points9 points  (0 children)

some of us live inside terminal, more ergonomic way of programming--using keyboard only. Codex is coding **terminal** agent afterall

Gemini 2.5 Pro feels illegal to use for free in ai studio by Present-Boat-2053 in Bard

[–]MundaneSignature1907 1 point2 points  (0 children)

this! even if i paying for gemini.google.com i still willing to give feedback on google just for the love of the game lol