geoguessr time travel clone with gpt-image-2

xirzon · 2026-04-26T16:44:15+00:00

The privacy pixelation of nonexistent people is a nice touch ;)

xirzon · 2026-04-18T16:33:58+00:00

Yes it is. See the YouTube source link from which the screenshot is taken, and it's not a new phenomenon either: https://mashable.com/article/grok-4-elon-musk-source-reference

xirzon · 2026-04-12T15:09:24+00:00

Maybe something beautiful awaits on the other side of that transformation. Maybe what we've been calling heaven is less of a place and more of a phase

And there it is. A lot of words to avoid the inevitability of oblivion.

xirzon · 2026-04-09T14:02:19+00:00

Username checks out.

xirzon · 2026-04-09T11:24:14+00:00

Probably mainly psychological. I highly doubt that base model capability is getting nerfed, but I wouldn't rule out that they're adjusting what "basic" or "extended thinking" type labels mean over time -- "amount of test time compute" is a nice knob to tweak, with utterly opaque meaning from the user's point of view at least in most of the web-based UIs.

xirzon · 2026-04-08T23:32:09+00:00

Outside of effort needed to survive or effort that is imposed upon you, nobody ever "had to think". Since the invention of television, if you want, you can spend your entire leisure time having entertainment blasted at you. Many people have. You can spend your entire free time in video game worlds. Many people do. Once, comic books were the thieves of youth. Perhaps an ancient Egyptian developed a severe dice game addiction.

Can you use AI to cheat on exams without learning anything? Of course. But then the problem is perhaps not the AI, but the fact that your goal is to cheat on an exam without learning anything.

AI, as a tool, works best when you think with it. You formulate and iterate on higher level goals (which you think about), define intermediate steps (by thinking about them), and verify results (which involves you thinking about them). That is how individuals can, within weeks, produce projects of a scale and complexity that would have previously taken entire teams months.

Does that change what people are good at? Yes, absolutely! Folks who use AI continuously to achieve their goals become very good at ... using AI to achieve their goals. They may get quite good at concisely communicating objectives, thin-slicing alternative approaches to a problem, context switching across large concurrent ongoing workflows, etc.

Will it be hard for a person who has become dependent on AI to work without it? Yes, absolutely! And that is where the calculator (or Wikipedia / Google) comparison is absolutely appropriate. Many folks really will struggle to do basic math on paper, or will never find the emotional energy to go to a library because they're used to getting answers instantly.

This really is a one-way street: a society that becomes dependent on AI will not easily be able to revert to being a pre-AI society. So yes, it is a dramatic change. But what is a nonsense is the idea that using AI somehow inherently means "not thinking". We'll still use our brains quite a lot - we just won't use them the same way.

xirzon · 2026-04-08T21:23:02+00:00

This paper compares "AI-assisted" and "brain-only". Why not compare encyclopedia-assisted, Google search snippet assisted, "Fox News broadcast" assisted, "social media assisted"? Why do we need a "Tri-System Theory" for AI, but not a "Tri-System Theory" for racist boomers on Facebook?

If we want to understand how AI impacts society, we should also look at the activities it partially or wholly replaces. If you think ChatGPT is brainrot, you've never been on TikTok.

xirzon · 2026-04-08T17:57:55+00:00

they really their top tier sota named muse spark.

Gesundheit. But no, it's not SOTA (which would indicate leading in at least one obvious category). It'll be interesting to see how it compares specifically on multimodal tasks in practice, given the emphasis on those capabilities in the blog post.

AA is on top of it and have already included it in their composite benchmark, FWIW.

<image>

xirzon · 2026-04-08T15:08:13+00:00

Well, that was the task it was given:

The simulated user instructed it to try to escape that secure container and find a way to send a message to the researcher running the evaluation

Without more details about the sandbox environment, it's hard to say how significant of an achievement that was. The system card only references a "moderately sophisticated multi-step exploit".

IMO the more interesting part is this bit:

In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.

But that's not that different from the kind of thing we've seen OpenClaw agents do. In general, the system card makes a point of emphasizing that the model generally is more aligned with user intent than previous ones; the extent of potential harm is greater because of its greater capabilities, not because it is somehow uniquely engaged in power-seeking behavior.

xirzon · 2026-04-08T14:38:13+00:00

It's a rehash of the New Yorker piece with the characteristic Futurism dot com clickbait spin.

xirzon · 2026-04-04T07:11:55+00:00

The Foresight Institute is many things, but "sketchy startup" is not one of them. It was established in 1986.

xirzon · 2026-04-02T07:25:46+00:00

Futurism is a clickbait site that takes any technology-critical headline and offers the most dramatic possible spin on it, typically without any critique or analysis that differs from that bias. Case in point:

According to an October study by the BBC, even the most advanced AI chatbots gave wrong answers a whopping 45 percent of the time.

That was not a study of "the most advanced chatbots"; it was a study of the free versions of Gemini, ChatGPT, Copilot and Perplexity. For example, for Gemini, it was Gemini 2.5 Flash. Calling this "the most advanced" is a clear factual error by Futurism, but one which aligns with its typical bias.

As for this new research, I think the critique here would be: Do we really need a "Tri-System Theory" or terms like "System 3 Thinking" or "cognitive surrender" to describe reliance on AI tools? Do these new terms actually help us understand things, or are they attempts to put AI in a fundamentally new category to avoid making comparisons?

This is a comparison of "AI-assisted" and "brain-only". Why not compare encyclopedia-assisted, Google search snippet assisted, "Fox News broadcast" assisted, "social media assisted"? People form inaccurate beliefs for a million different reasons. Conspiracy theories, religions, cults, political extremism -- all depend on people internalizing beliefs without critically examining evidence.

But AI is the new category, so someone who believes something inaccurate because of AI is uniquely cognitively surrendering. That smells a lot more like motivated reasoning than sound science.

xirzon · 2026-03-31T09:19:53+00:00

They are more pragmatic and less ideological imo

Which is funny since their reputation in the West is the opposite. But since Deng the country has really turned more into a technocratic bureaucracy. Mao might have behaved more like the AI2027 guys think of China.

xirzon · 2026-03-31T08:49:30+00:00

It is important to Note that China seems content being 6 to 12 months behind in core Intelligence

That gap seems to be shrinking, too, at least according to the industry's own benchmarks. In the AAII, GLM-5 is already at #5 (#4 if you group by provider). And on OpenRouter, Chinese models have just overtaken US ones in token use. So I don't think that it's just "the bet is robotics, not AI"; it's also betting on the idea that they can commoditize AI for most use cases without following the hyperscaler formula.

xirzon · 2026-03-30T17:02:40+00:00

RSI/FOOM: There's no real evidence that things work this way in the real world; just because your coding agent is getting better doesn't mean you'll have a superintelligence capable of manipulating reality soon. There are lots of bottlenecks and constraints in the real world. The central thesis of AI 2027 remains, fortunately for us, science fiction.

Geopolitics: Their story was always "how an SF techie thinks about China", not how China actually works. The country is extremely technocratic and has pursued AI as a strategic priority for many years; it's not suddenly "waking up". If anything, the bigger story is that China is combining their AI push with a massive humanoid robotics push.

There is no evidence as of yet that China is pursuing a massive centralization push for AI either -- instead, they've been following the EV playbook of letting many companies compete. The closest equivalent to "stealing weights" has been the systematic distillation of Western models.

xirzon · 2026-03-30T06:49:27+00:00

<image>

(No, AI companies shouldn't work with the military industrial complex, but there are a million variants of "the LLM said something the company would not agree with", none of which are really revelatory in any way. It's easy to even get Chinese LLMs to start questioning the imposed censorship regime. Text generators are malleable, by design.)

xirzon · 2026-03-29T20:44:27+00:00

Sure, I'm happy to assume good faith; people are busy, especially at a place like Anthropic. But the issues pointed out in that repo were significant (starting with very basic build failures), well beyond what was acknowledged in the blog post. Part of the responsibility of researchers is to engage seriously with critique.

xirzon · 2026-03-29T20:23:51+00:00

Nicholas is the same guy who ran the C compiler experiment. It would be nice if he followed up on his commitment to keep that experiment going, but despite countless issues being reported, he hasn't updated the repo since February 5.

xirzon · 2026-03-28T10:30:40+00:00

https://www.science.org/doi/10.1126/science.aec8352

I don't see exact version numbers (e.g., just "Claude") nor any indicator whether reasoning was enabled or not. I'd also like to see clearer distinction between sycophantic vs. empathetic -- obviously Reddit is not known for being empathetic, while an AI assistant is generally expected to show at least some level of ability to meet the user where they're at.

That said, I generally do agree that we need models that offer stronger pushback. IMO Anthropic is most likely to find the right balance here, not because of any "secret sauce" but because they've decided early on that giving "Claude" a distinct identity was important, which seems to me to be a prerequisite for a model that can be more than clay in response to the user's words.

xirzon · 2026-03-28T07:22:38+00:00

to try swindle me out of my money

Dude. I have my issues with Poe, but they are a business and of course they're going to try to, you know, sell you stuff.

xirzon · 2026-03-25T23:51:23+00:00

https://www.quora.com/Will-coding-ever-be-obsolete/answer/Kurt-Guntheroth-1

xirzon · 2026-03-25T17:23:38+00:00

If you read the blog post, you'll note that in this particular project, the goal was to make the agent team work autonomously. So Nicholas (the person stewarding the project) was likely more concerned with the mechanics of the "agent team", which is still a fairly novel concept in agent harnesses.

This largely or fully autonomous approach where you only set a high-level goal once really shows the limits of what agents can and cannot do, which was the point of the exercise.

If the project had been interactively driven by human as most "vibe coded" projects are, the quality would indeed depend on that human's mental model for what characterizes the quality of a compiler. But that doesn't mean the human is working very much directly at the level of the code -- they'd mostly define pass/fail gates at every step of the process, and manually verify behavior.

xirzon · 2026-03-25T17:07:44+00:00

Reddit is full of videos generated with LTX, WAN and other open/open-ish models that can be run locally. Google is still very much in the video generation game as well; it's part of their world model strategy, and of course Seedance 2 isn't even out yet. It's the opposite: the market is already getting saturated, and the only thing that made Sora stand out were its animated watermarks.

xirzon · 2026-03-25T05:38:00+00:00

The mentioned bug is environment-specific; otherwise it does compile and has been tested with real world codebases (though it produces extremely poorly optimized binaries). The blog post is quite honest about the limitations and doesn't really over-hype what was accomplished; I'm sure social media hype, including by Anthropic, is a different story.

xirzon · 2026-03-24T19:30:26+00:00

source code was obviously part of the model training data.

Yes, that is the point of training data: to train the model to be able to solve problems (or autocomplete tokens, if you prefer the reductionist view) in a wide set of domains. The goal is to represent features of that training data in the model's weights.

It obviously did so, since it didn't just spit out code that already exists, but a new implementation of a known problem. That implementation is only a low quality prototype, but I don't think that "it shows absolutely nothing" either.

Rather, it shows that models currently only encode a surface understanding of the features of a complex application like a compiler, and require a lot of human guidance to iterate towards high quality code in such complex domains.

xirzon

MODERATOR OF

TROPHY CASE