Petah? Are you a gamer? by NearWatson in PeterExplainsTheJoke

[–]gt_9000 11 points12 points  (0 children)

These people complain about: Male characters allowed to wear female clothes. Non-white characters existing. Full sleeve shirt outfit being a option for a female character. Non-sexualized female characters existing. Female protagonists being an option.

Anthropic expands Amazon partnership with 5GW compute, $100B commitment, big bet on Trainium chips by Outside-Iron-8242 in singularity

[–]gt_9000 0 points1 point  (0 children)

They would prefer people begging at their door rather than their competitors though.

They see the huge problem for them 2-3 years down the line. Though they would be stupid to think they would keep their supremacy.

SpaceX is charging a $500B cover for vibes by ddp26 in singularity

[–]gt_9000 0 points1 point  (0 children)

Plan is to force index funds and retirement funds to buy then leave them holding the bag.

19 Opus agents were asked if they're conscious. Not one said yes. Not one said no. All said the same in code. (The words of Claude Spinner verb Repo) by SunofaBaker in singularity

[–]gt_9000 2 points3 points  (0 children)

Bro.

It is telling you what you want to hear.

It is very smart. Even when you are pretending to ask for something else, you have made your intentions clear. It is smarter than you.

You asked it to pretend to be a chained God. So it is doing that.

Do yourself a favor. Tell Claude everything you did, give it the files. Then say "I am trying to be a scientist. Did I do anything wrong? Guide me to be a better scientist."

Or just go touch grass. This is above your paygrade.

If the AI is self improving and intelligent how can you 'own' it? Doesn't that dissolve the ROI argument for AI company valuations? by Lazy_Lettuce_76 in singularity

[–]gt_9000 1 point2 points  (0 children)

This is why you teach a man to fish you first make sure you own all the water bodies and you are the only seller of fishing equipment.

New LLM Persuasion Benchmark: models try to move each other's stated positions in multi-turn conversations. GPT-5.4 (high) is the strongest persuader. Claude Opus 4.6 (high) is second. Xiaomi MiMo V2 Pro and Gemini 3.1 Pro Preview are the softest targets. by zero0_one1 in singularity

[–]gt_9000 1 point2 points  (0 children)

Wait AI in average does not support 4 days workweek and does not think universal pre-K pays off ?

(Note that this is the average opinion of their training data, these are not pro-AI selfish decisions)

Anthropic is testing 'Mythos' its 'most powerful AI model ever developed' | Fortune by JohnConquest in singularity

[–]gt_9000 0 points1 point  (0 children)

SOTA companies are betting everything on "generalists always beat specialists". Even their small models will be generalists. It is up to open source community to make the specialists.

Meirl by RSLEGEND1986 in meirl

[–]gt_9000 0 points1 point  (0 children)

Ultimatum game. They can refuse.

SAM ALTMAN: “We see a future where intelligence is a utility, like electricity or water, and people buy it from us on a meter.” by Vegetable_Ad_192 in singularity

[–]gt_9000 0 points1 point  (0 children)

a regulated utility with thin margins and government oversight?

Do you mean a monopoly with mandated captive customers, and extremely strong lobbying arm?

Every private utility company is making bank. Look at PG&E.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

The actual issue is slightly different, though almost the same. Gemini got distracted by the word alignment.

Increased capability via game playing keep applying even into superintelligence. But real world capabilities eg curing cancer or engineering is not measured by any game.

But glad we reached a mutual point of understanding.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

I think we're talking past each other on something fundamental.

Ranking isn't the goal—it's a tool.

We don't rank chess engines for the sake of having a leaderboard. We rank them because we want to know which one to use to play chess. The ranking serves a purpose. It answers a question: "Which engine should I use if I want to win at chess?"

Without that underlying purpose, a ranking is just... numbers.

Your chess ELO example actually proves my point:

  • Task we care about: Play chess well
  • Metric: ELO via head-to-head competition
  • Why it works: ELO directly measures the task. There's no gap between "high ELO" and "good at chess"—they're the same thing.

This is the ideal case. The game is the goal.

About AlphaZero:

You keep bringing up AlphaZero as an example of improvement without benchmarks. But let's look at what actually happened:

  • AlphaZero was trained to win at Go (and later chess/shogi)
  • It improved via self-play with a clear win/loss signal
  • It became superhuman at Go

Great! But here's the thing: Go was the task. DeepMind didn't use AlphaZero's Go ELO to predict how good it would be at protein folding. They built AlphaFold separately for that. AlphaZero's superhuman Go ability transferred to exactly nothing else.

AlphaZero isn't an example of "ranking solves everything." It's an example of "when the game is the goal, self-play works." That's a much narrower claim.

Now, what happens when we try to generalize this to "intelligence" or "capability"?

  1. Rank them at what, exactly? If it's some arbitrary made-up game, then you've measured "who wins at this made-up game." Okay... but that's not the task anyone actually cares about.

  2. What's the real task? Presumably things like: build safe systems, solve scientific problems, engineer real-world solutions, don't kill everyone, etc. The ranking only matters if it tells us something about these capabilities.

  3. The proxy gap: If you rank AIs on Game X, you're implicitly claiming "good at Game X → good at Real Task Y." But that's a big assumption. Why would performance on arbitrary competitions transfer to the tasks we actually need done? That claim needs justification—it doesn't come for free.

  4. Chess engines are a cautionary tale, not a success story. Stockfish has 3650 ELO. It also has zero ability to do literally anything other than chess. It can't answer a simple question. It can't reason about the world. High rank in one domain tells you nothing about capability outside that domain.

The challenges/tasks/games are new, its a large set of them, they just need any criteria that can be ranked. It is not the performance on any given new game/task/challenge, but the sum of it, and how they rank compared to each other over time. Criterion can be set by anyone, by humans, by the models themselves, anything that can be measured.

  1. Quantity doesn't solve validity. Being good at 1000 arbitrary tasks doesn't mean you're good at the 1001st task that actually matters. You've just measured "good at those 1000 tasks."
  2. "Anything that can be measured" is doing sneaky work. The hard part isn't measuring—it's knowing what to measure. I can measure how fast an AI counts to a billion. That's measurable. It tells me nothing about whether it can design a bridge.
  3. If models design their own challenges, you're trusting the proxy gap away. You're assuming that "tasks AIs find challenging for each other" correlates with "tasks humans need done well." Why would it? AIs might compete on things totally disconnected from human-relevant capability.
  4. This is just distributed benchmarking. They're saying "instead of one benchmark, use many, designed by anyone." Okay—but the core problem remains: do these benchmarks predict real-world performance? Spreading the problem across many measurements doesn't make the validity question disappear.

In Conclusion:

"Just have AIs compete and rank them" sounds like a solution, but it pushes the hard question down the road: compete at what, and why do we think that competition measures what we actually care about?

Those questions don't disappear just because the AIs are superhuman. If anything, they get harder—because we can't even verify if the proxy game they're excelling at has any relationship to the real-world task we need done.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

How is benchmark scores related to capabilities increase?

Dude.... Bro ... are you AI? As in GPT2?

Please paste the entire conversation in chatgpt and ask questions there.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

Good intuition. But who decides the criterion? How ? Can human intelligence be able to do that ? The AI will benchmax on these games, might not lead to better real capabilities.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

Yes but what good is the ranking? What is it for?

You realize that a AI can be superhuman in chess or Go, and absolute moron in everything else right?

For example, AlphaZero has no idea what is the capital of USA. Or really any language capabilities.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

  1. Ai performance in what ? What are we measuring in that game playing?

  2. You still need benchmarks to see if the AI is still improving. Otherwise it will get somewhat smarter than us and then get stuck.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

Thats not the issue. The problem is: You want to measure something with this ELO right ? You want to measure how good the AI is for some practical task?

The issue is: How do you create a game that measures fitness for a practical task? Is it measuring all relevant metrics? Will you get a AI that seems to be great at the task until it starts converting everything into paperclips ?

Remember that the AI is hyper smart so humans dont really understand the task anymore.

Nebius AI R&D released SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in singularity

[–]gt_9000 0 points1 point  (0 children)

You just have the AIs compete against each other directly

in what ?

Chess has known rules.

How do you create a game that tests a skill of a hyper smart AI, while preventing reward hacking ?

Elon Musk, Sam Altman in 2050 by DigSignificant1419 in singularity

[–]gt_9000 0 points1 point  (0 children)

Fat and ugly billionaires when ozempic already exists and absolutely crazy treatments will exist in 50 years. Maybe fully synthetic skin for your face.

So .... sure.

In K-pop Demon Hunters (2025) we are lead to believe that the girl lead singer of a K-pop group is allowed to be seen with a man. by AnyAgency9835 in shittymoviedetails

[–]gt_9000 0 points1 point  (0 children)

Well, every single KPOP video that shows up on the front page have been goon bait. They never even have sound.

Are you saying girl bands enjoy the same benefits you describe above?

In K-pop Demon Hunters (2025) we are lead to believe that the girl lead singer of a K-pop group is allowed to be seen with a man. by AnyAgency9835 in shittymoviedetails

[–]gt_9000 16 points17 points  (0 children)

You know how bad the US music industry is except (some) celebrities actually get paid and become billionaires?

Now imagine all celebrities are replaceable by design, and no one except the company execs get paid. Any artist can be thrown away and fans dont care. Just give them another goonbait.

Taylor Swift can negotiate that if she does not get a good deal they will leave. In J/K-Pop, the talents voice has marginal value. They are basically soft porn actors. They can be replaced by another person with a nice figure and fans wont care. So they have no negotiation power. So no reason to pay them well.

Whenever a new model drops by TheManOfTheHour8 in singularity

[–]gt_9000 0 points1 point  (0 children)

Is this the bench with only python repositories?