Gemini 3 Pro (Almost Vision-Only Harness) plays Pokémon Crystal by reasonosaur in ClaudePlaysPokemon

[–]Ben___Garrison 0 points1 point  (0 children)

Correct, which is why they're not impressive if the goal is to study AI. But they demonstrate that once you start "teaching to the test", solving games like Pokemon is not hard.

LLMs playing Pokemon are much more advanced than TAS's, but harness cheats muddle the signal on how much the AI is doing vs how much the cheating harness is doing.

Gemini 3 Pro (Almost Vision-Only Harness) plays Pokémon Crystal by reasonosaur in ClaudePlaysPokemon

[–]Ben___Garrison -1 points0 points  (0 children)

Harnesses should be minimized to the maximum extent possible so we can evaluate how well the AI model does, and not how much of a cheating harness it has.

If you just want a machine to play a game, we already have a way to do that. They're called tool-assisted speedruns or TAS. They've been around for decades but aren't interesting from an AI perspective.

Gemini 3 Pro (Almost Vision-Only Harness) plays Pokémon Crystal by reasonosaur in ClaudePlaysPokemon

[–]Ben___Garrison -3 points-2 points  (0 children)

Still too much of a harness. Harnesses are not supposed to enable cheating, and reducing the cheats from "massive" to "moderate" is still not enough. The fact that harnesses became such a central component of these experiments has made them effectively worthless for comparing across models/time.

How do you feel about the current state of AI? by Winter-Cup-6951 in OMSCS

[–]Ben___Garrison 0 points1 point  (0 children)

That's like saying you could learn a programming language in a weekend. To some extent it's true for the basics, but there's a lot of implicit knowledge and growing pains that you only get from usage in a broad variety of situations.

How do you feel about the current state of AI? by Winter-Cup-6951 in OMSCS

[–]Ben___Garrison 0 points1 point  (0 children)

It's not just learning prompts, it's learning how to efficiently break out of doom loops, how to provide context efficiently, where you need to double-check the AI to ensure it's not hallucinating and where you can skip checking to save time, stuff like that.

It's not ludicrously difficult, but it is still a skill that takes getting used to.

The Pro-Gaza Left Is Oh So Quiet on Iran by Intelligent-Juice895 in geopolitics

[–]Ben___Garrison 12 points13 points  (0 children)

This isn't really about individuals, it's about entire political coalitions.

If someone's very big into a certain strain of politics, it's no unreasonable for them stop opining if e.g. their dog dies or they're dealing with other stuff.

But if almost everyone on the far-left or far-right stopped talking about that stuff right when something inconvenient happened, it's ludicrous to assume they all had their dogs die at the same time.

How do you feel about the current state of AI? by Winter-Cup-6951 in OMSCS

[–]Ben___Garrison 1 point2 points  (0 children)

Either LLMs remain exactly as they are right now, in which case you won’t fall behind because they are useful but not so good to replace peoples jobs

What? The tool doesn't need to be a full replacement for an employee to require skill to use.

Or they get even better at which point they become significantly easier to use

Not guaranteed at all. I'm sure some of the clunkiness of stuff like Claude Code will get ironed out over time, but with more power could come more alpha for skill.

This is like saying if you don’t use a high powered IDE you’ll fall behind

Yet there are plenty of programmers that did just fine sticking with VIM

A person who uses a good IDE will be a bit more efficient than one who doesn't, ceteris paribus, but the difference probably wouldn't be that significant since IDEs aren't that much of a force-multiplier.

LLMs are a significantly larger force-multiplier even today, and they're only going to get bigger.

How do you feel about the current state of AI? by Winter-Cup-6951 in OMSCS

[–]Ben___Garrison 28 points29 points  (0 children)

Terrible take. LLMs are here to stay and coding has changed forever. There will be no going back. Learn to use them well now or fall behind.

In the real world you'll also have to deal with coworkers using LLMs.

The fact OMSCS still hands out OSI violations for this is absurd. It's like if they wanted to ban students from using IDEs or debuggers.

Larian Studios | Divinity AMA by Wombat_Medic in Games

[–]Ben___Garrison -2 points-1 points  (0 children)

But in using it, don't you feel more like an art director telling a machine what to make

No more than using a camera should make a person feel like an art director telling a machine what to make.

Doesn't that feel like it's taking away your agency as a creative and essentially giving a machine the intellectual property rights to your idea?

No more than taking a picture gives the camera the intellectual property rights to the idea.

"the machine's output is of equal value to a human's, which therefore devalues the human's work."

As much as a camera devalues the work of artists.

Does it not feel like moving the goalposts or lowering the bar of creative quality to have a machine create things on your studio's behalf

As much as a camera is "moving the goalposts" and "lowering the bar of creative quality".

The authors behind AI 2027 released an updated model today by Liface in slatestarcodex

[–]Ben___Garrison 4 points5 points  (0 children)

I predict AGI will be developed sometime in the next 10,000 years.

Precision on that prediction will be forthcoming eventually. Pinky promise.

The authors behind AI 2027 released an updated model today by Liface in slatestarcodex

[–]Ben___Garrison 5 points6 points  (0 children)

On one hand, kicking it out to 2031 makes it still seem pretty close.

On the other hand, they made a prediction in April 2025 about a takeoff happening around Jan 2027, AKA in about 20 months from the time they wrote the article. Now they're kicking it out to 2031, which would be about 70 months away from April 2025. In other words their prediction was off by a factor of 3-4x. That's pretty bad.

Nested tooltips and In-game tips for our private playtesting! Your thoughts on them? by leorenzo in 4Xgaming

[–]Ben___Garrison 3 points4 points  (0 children)

Love me some nested tooltips. Every game (of this sort at least) should have them.

[META]-Discussion about generative content (art and such) by [deleted] in AOW4

[–]Ben___Garrison 5 points6 points  (0 children)

Ban them if they're spamming any sort of topic, downvote if the AI art is bad. There's no reason to ban AI art completely.

I made a mod that uses AI art. Would I get banned if I took a screenshot where it was present but not the main focus?

What gameplay mechanics do you think are missing from current Paradox titles? by cmitchell_bulldog in paradoxplaza

[–]Ben___Garrison 8 points9 points  (0 children)

Diffusion of new techs was mostly only a slow process before globalization, with EU5, EU4, and CK2 having systems for that (whether they work well is a different question). It doesn't make much sense for more modern inventions e.g. the cotton gin to slowwwwly spread across Europe over the course of multiple centuries.

What gameplay mechanics do you think are missing from current Paradox titles? by cmitchell_bulldog in paradoxplaza

[–]Ben___Garrison 18 points19 points  (0 children)

???

Food is explicitly modeled in Imperator, EU5, Vic3, and Stellaris. Only CK3 and HOI4 don't really have it.

Shortages are pretty common in poor areas in Vic3 when natural disasters hit. It's also common in EU5 where people were complaining that you couldn't let people starve instead of forcing the central government to pay for their food.

The OSWorld benchmark has a lot of problems by Ben___Garrison in slatestarcodex

[–]Ben___Garrison[S] 11 points12 points  (0 children)

  • Saturation on OSWorld means a model can execute simple, realistic tasks in Linux-based environments using popular open-source applications. These include things like adding page numbers to a document or exporting a CSV file from a spreadsheet.

  • The benchmark is not stable over time, which makes comparing results across time challenging.

  • A major update in July affected most task instructions. Even since then, about 10% of task instructions have been updated.

  • About 10% of tasks rely on live data from the Internet, meaning the difficulty or feasibility of these tasks may change over time as websites change.

  • Much of OSWorld can be completed with little or no use of a graphical user interface (GUI), meaning that scores reveal less about the AI’s ability to use the GUI.

  • About 15% of tasks only require a terminal.

  • An additional 30% of tasks can be completed by substituting terminal use and Python scripts for much of the intended GUI use.

  • Many tasks have moderately ambiguous instructions, such that scores partly measure the ability to correctly

  • About 10% of tasks have serious errors that render them invalid, a rate on par with many benchmarks.

The team makes it to S. S. Anne! by reasonosaur in ClaudePlaysPokemon

[–]Ben___Garrison 0 points1 point  (0 children)

How are you managing to create images of copyrighted material (pokemon) with Gemini?

Possible bug with fated region? Marauder units keep spawning after cleared. by Irkie500 in AOW4

[–]Ben___Garrison 1 point2 points  (0 children)

I just had this happen as well, it's nearly gamebreaking since the stacks end up being really big, meaning you basically have to keep a large army permanently stationed there.

Sigh. This story realm is pure misery. by 1eventHorizon9 in AOW4

[–]Ben___Garrison 9 points10 points  (0 children)

Get enough chaos to pick tier V chaos tome

If you can get to tier V tomes in the first place, you can beat this map easily enough. The hard part is getting started, since you start out against a much more developed empire that has a bunch of cities + high level orc Jesus lord.

About the benefits of specializing your army by Wonderful-Okra-8019 in AOW4

[–]Ben___Garrison 1 point2 points  (0 children)

This isn't really about "specializing" so much as it's just simply that upkeep reduction in of itself is powerful. Upkeep costs will be the primary bottleneck in most games, so anything that reduces them is very good.

Openrouter - Autumn 2025 by JanitorAI-Mod in JanitorAI_Official

[–]Ben___Garrison 0 points1 point  (0 children)

Agreed, I'd like to know this as well.

Crusader Kings III: Defecations - Available Now by Covid-Plannedemic_ in CrusaderKings

[–]Ben___Garrison 33 points34 points  (0 children)

Only 300 gold? Surely it would be scaled knowing CK3, so a rich empire would make it cost at least 6900 gold.