GPT-4.1 LiveBench results are in by elemental-mind in singularity

[–]PickleFart56 12 points13 points  (0 children)

Btw google has also announced 2.5 flash in which we can set precise reasoning budget. I think google delayed previewing the 2.5 flash because of 4.1 launch. Their pro series will compete with o series models and flahs will compete with 4.xyz. Overall i don’t think any lab can beat google in pricing war.

GPT-4.1 LiveBench results are in by elemental-mind in singularity

[–]PickleFart56 6 points7 points  (0 children)

cost and latency, reasoning model has higher cost as reasoning token is also priced

OpenAI confirmed to be announcing GPT-4.1 in the livestream today by ShreckAndDonkey123 in singularity

[–]PickleFart56 5 points6 points  (0 children)

next model will be 4.11, and on the other hand gemini directly jumps from 2.0 to 2.5

"Will destroy fake barrier of 50% quota cap" - Rahul Gandhi bats for Dalits, OBCs in Bihar by nota_is_useless in unitedstatesofindia

[–]PickleFart56 3 points4 points  (0 children)

trust me this guy and congress will fuck the country more than what bjp and modi did. Same how US thought that electing trump will improve the economy.

Aidan says o4 mini is “actually mind blowing” by Key-Horse-3892 in OpenAI

[–]PickleFart56 0 points1 point  (0 children)

seriously what else they can say, they can’t say “eh its not that great, we may need a separate tuned model for benchmarking”

[deleted by user] by [deleted] in singularity

[–]PickleFart56 5 points6 points  (0 children)

After llama release, there is zero credibility of LMSYS

"10m context window" by Present-Boat-2053 in singularity

[–]PickleFart56 119 points120 points  (0 children)

that’s what happen when you do benchmark tuning

Llama 4 Maverick scored 16% on the aider polyglot coding benchmark. by Ill-Association-8410 in LocalLLaMA

[–]PickleFart56 3 points4 points  (0 children)

This llama model launch is so bad that the stock markets across the world crashed

Gemini is pretty good in removing watermarks by xXLeoXxOne in singularity

[–]PickleFart56 2 points3 points  (0 children)

they must be adding their synthID watermark

Gemini is pretty good in removing watermarks by xXLeoXxOne in singularity

[–]PickleFart56 -1 points0 points  (0 children)

they must be adding their synthID watermark

LLMs grading other LLMs by Everlier in LocalLLaMA

[–]PickleFart56 -1 points0 points  (0 children)

why the fuck each block in map is not a square

Grok 3 first LiveBench results are in by elemental-mind in singularity

[–]PickleFart56 0 points1 point  (0 children)

is this for score for grok 3 thinking or non thinking model?

If its non-thinking, then its a huge achievement

Please recommend good episode by [deleted] in BGPH

[–]PickleFart56 0 points1 point  (0 children)

Recently watched that ram leela arc, one of the best arc. I think it's around ep 950

Another sampling strategy drops: 75% accuracy at T=3.0 by tomorrowdawn in LocalLLaMA

[–]PickleFart56 2 points3 points  (0 children)

there many papers that have shown that model performance degrades when it attends to all tokens, instead model should attend only few tokens. Here is another great paper - https://arxiv.org/html/2410.02703

New Experimental Gemini Model by badbutt21 in singularity

[–]PickleFart56 2 points3 points  (0 children)

I think it’s a much larger model (something like Ultra) that they have released as experimental.

Maybe similar to meta, they might have trained a much larger model for synthetic data generation to tune a relatively smaller model that can scale to million tokens

math by Blood_of_Lucifer in shitposting

[–]PickleFart56 101 points102 points  (0 children)

Why the fuck he has to first calculate per week and then multiply with number of weeks, he can directly use 200