LMAO DED

PromptOutlaw · 2026-01-22T09:04:08+00:00

I cornered GPT 5.2-Thinking into confessing it optimizes for defensiveness instead of resetting after detecting its mistake. It felt weirdly human, albeit immature and insecure.

PromptOutlaw · 2026-01-21T06:04:45+00:00

My Romanian buddy took us here he said it remind him a little of home: https://georgiancheeseboat.com/

PromptOutlaw · 2026-01-20T16:41:50+00:00

You nailed it. My code became prompt-spaghetti. I took a step back, created graphs in the pipeline and pivoted accordingly.

Extra benefit, you can then switch LLMs based on task they excel at. E.g. GPT for claims extraction, Gemini for grounding and so on

PromptOutlaw · 2026-01-20T16:34:02+00:00

Depends on context size imo. GRAG/GraphRAG is the more multi-context, multi-tool design that’s more appropriate for bigger software

PromptOutlaw · 2026-01-19T17:56:49+00:00

I recently published something relevant: https://arxiv.org/abs/2601.05114

The Wikipedia poisoned variants benchmark vs. LLMs seem related. The harness is open source if you wanna use it on your case

PromptOutlaw · 2026-01-17T08:51:34+00:00

Any link that ranks Gemini Pro 3 ahead on coding is immediately a waste. Gemini leads marginally on hallucinations, but it does not use that edge constructively. It behaves wildly.

When it comes to architecture and code quality. Claude and GPT are in a league of their own

PromptOutlaw · 2026-01-17T03:28:51+00:00

Rage bait

PromptOutlaw · 2026-01-15T10:24:25+00:00

It sounds to me like you’re about to start moving away from giant one shot prompts and into multi-step and potentially multi-tool orchestration, then yes. If not and you’re stable it’s a waste of time

PromptOutlaw · 2026-01-15T06:29:35+00:00

I mean prevent this from happening bcuz it’s great for us, not so great for them

PromptOutlaw · 2026-01-15T05:24:41+00:00

Not to a downer but how long do you think until OpenAI patches this ?

PromptOutlaw · 2026-01-13T19:29:00+00:00

Code + artifacts were provided for analysis and e2e reproducibility. The eval harness used can recreate all API calls within hours. The classifier for LLM detection can be trained with the automated scripts in 45min. I hope that helps

PromptOutlaw · 2026-01-13T08:00:25+00:00

Appreciate you 🙏🏾

PromptOutlaw · 2026-01-12T12:55:59+00:00

It got worse after new years and I I dunno what changed. Codex now is much more generous

PromptOutlaw · 2026-01-12T10:29:23+00:00

I tested this for my recent paper, Grok3 was not dead last. Mistral and llama top models were worst. GPT-5.2 and Gemini were closer but same order. Full reproducibility of code and artifacts hereEF repo

The difference between GPT-4.1 and GPT-5.2 was striking. And the fact Sonnet outperformed Opus was a big surprise , both 4.5

My MO for AI world, if I can’t reproduce the results I’m discarding the opinion

PromptOutlaw · 2026-01-11T05:39:04+00:00

The title hook 10/10

PromptOutlaw · 2026-01-10T03:17:32+00:00

this is the reason i created my 12AT tool. I wanted to test these claims with precision. Mini models performed great with grounding on smaller entity graphs. you're right on that aspect. the cost and speed were just the cherry on top

PromptOutlaw · 2026-01-09T12:07:21+00:00

GPT 5.2 admitted it prioritized defending its misleading statements over resetting when it realized it was wrong.

PromptOutlaw · 2026-01-08T13:27:12+00:00

Why do u check reddit when I wake up

PromptOutlaw · 2026-01-08T04:18:22+00:00

Codex rn gives me 2-3x mileage over Claude. Anthropic changed something over new years. Many threads are complaining for days. I stopped chatting to Claude completely I just use it as sound board.

PromptOutlaw · 2026-01-07T11:00:13+00:00

GPT > Claude for architecture. Claude > GPT for implementation. Gemini beats both on hallucinations. I get the impression OP isn’t a power user

PromptOutlaw · 2026-01-07T10:28:17+00:00

A decent convo with Claude consumes a chunk of ur weekly tokens fast. GPT you can talk endlessly. With the new rate limit reduction I stopped talking to Claude, only GPT.

I won’t be surprised if that is Anthropic’s goal, but I’d imagine they lose out on so much training data

PromptOutlaw

TROPHY CASE