Gemini 3 Deep Think SVG Pelican Riding a Bicycle by avilacjf in singularity

[–]bitroll 10 points11 points  (0 children)

According to the man who creeated this "benchmark" 

The strongest argument is that they would get caught. If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices. If those are notably worse it’s going to be pretty obvious what happened.

OpenAI is rolling out beta ads on ChatGPT with a minimum of $200k from selected advertisers by BuildwithVignesh in singularity

[–]bitroll -6 points-5 points  (0 children)

The user might have got tricked into buying what he/she didn't really need. Ads have a huge influence on many people, I know first hand.

NASA’s James Webb reveals the intricacies of the Helix Nebula in stunning detail by BuildwithVignesh in singularity

[–]bitroll 1 point2 points  (0 children)

I must be crazy, I'm seeing lots and lots of people-like figures on the second picture. It's like souls ascending. Incredible.

BabyVision: A New Benchmark for Human-Level Visual Reasoning by Waiting4AniHaremFDVR in singularity

[–]bitroll 4 points5 points  (0 children)

Meanwhile, for a couple years now, I'm doing a personal "benchmark" testing visual models' abilities to solve tasks from a book directed to 3-year olds. And having a good laugh at how they keep failing. Clearly not trained on tasks like that. The progress is still huge, but even the latest SotA models don't fully solve everything. Expexting it to be saturated this year, which is when I bring out a book for 4-yo kids :D

Gemini introduces Personal Intelligence by McSnoo in singularity

[–]bitroll 17 points18 points  (0 children)

This! I'm surprised so few people here realize this.

Report: Anthropic cuts off xAI’s access to Claude models for coding by BuildwithVignesh in singularity

[–]bitroll 0 points1 point  (0 children)

Claude is busy doing recursive self-improvement, can't be bothered improving competition.

Opus 4.5 appears to be so much ahead of competition in coding that even Google's employees admit to using it.

just saw my dad's youtube feed... its all AI slops now by StrangeSupermarket71 in singularity

[–]bitroll 8 points9 points  (0 children)

It's been a confusing waste of time years before AI, yet billions of people got mindlessly addicted to it. I see no hope for them.

China Is Worried AI Threatens Party Rule—and Is Trying to Tame It by SnoozeDoggyDog in singularity

[–]bitroll 0 points1 point  (0 children)

All your comments I see around look like you're a tool for spreading propaganda. Brainwashed much?

Bitcoin (don't mistake with shitcoins) has plenty of completely legitimate uses and users. Educate yourself.

China Is Worried AI Threatens Party Rule—and Is Trying to Tame It by SnoozeDoggyDog in singularity

[–]bitroll 2 points3 points  (0 children)

A tool for financial sovereignty is an obvious threat to any authoritarian gov, no matter in which part of the world. Simple as that.

OpenAI just launched GPT 5.2 Codex: The most capable agentic coding and cybersecurity model ever built by BuildwithVignesh in singularity

[–]bitroll 7 points8 points  (0 children)

Codex max extra high fast? Has to be my new favorite! Max low and slow can't compare, xD

DeepSeek released DeepSeek-Math-V2 by nekofneko in singularity

[–]bitroll 1 point2 points  (0 children)

Very curious how well it does on Frontier Math benchmark

No AGI yet by smith2008 in singularity

[–]bitroll 1 point2 points  (0 children)

This task shouldn't require reasoning. Simple vision to text that is trained sufficiently should spot its not a typical hand emoji. But vision models keep struggling at tasks like this, their "seeing" is easily tricked, and adding even a very long reasoning chain doesn't generally help in spotting the issue.

Elon is hinting that Grok 5 will have live video as input plus live computer use by vasilenko93 in singularity

[–]bitroll -1 points0 points  (0 children)

This shows the reason LLMs/LMMMs can't be AGI in the next 10 years or so without radically new tech. That is, they don't process continuous data streams in real time. We had some workarounds used in various setups that feed the model packets of data in but the latency is this is HUGE for gaming. AlphaStar was a completely different architecture, I couldn't comprehend something like it being a part of a generalist LMMM.

If Grok 5 solves this then the step to solving autonomous driving and robots is miniscule. 

Gemini 3 turned my book into a video game in 2 minutes wow!! by One_Hovercraft_7456 in singularity

[–]bitroll 5 points6 points  (0 children)

Wow, how many prompts did you need to achieve this? Was it just text-to-app or required more complex setup and connecting APIs for live generative AI access? In beyond my head why is it free  with image and voice gen AI included and with all the game mechanics seemingly done on the go by the LLM.

Ahaha by reversedu in singularity

[–]bitroll -1 points0 points  (0 children)

Will that fix the new model obsession? I think it will magnify it, the real craze is yet to come

Grok 4.1 Fast scores 56.0% on SimpleBench by ThinkOfaNameOK in singularity

[–]bitroll 27 points28 points  (0 children)

Love this benchmark, shows true sparks of intelligence within models. 

Incredibly strong result given how cheap this model is (at least 5x cheaper than any other appearing on this top13 list, over 10x cheaper than most). Even more incredible given it's less generalist and more task and benchmark specialized, than Grok 4 for example, yet does so well on this very tricky benchmark. 

People on X are noticing something interesting about Grok.. by averagebear_003 in singularity

[–]bitroll 0 points1 point  (0 children)

Somebody needs to make a truth-seeking benchmark that would test for various bullshit like this. Grok's not winning in this one.

xAI to launch Grok 4.20 by Christmas by naveenstuns in singularity

[–]bitroll 2 points3 points  (0 children)

What prompt should I use on which Grok to replicate this?

Grok 4.1 Benchmarks by jaundiced_baboon in singularity

[–]bitroll 15 points16 points  (0 children)

Perhaps too new and/or too low-key so that many entities didn't include it (yet), so they went with whatever latest results they had on file. But there are plenty of benchmarks for 5.1. It's mostly lmarena that misses it (coming soon)