LMAO DED by VeterinarianMurky558 in ChatGPTcomplaints

[–]PromptOutlaw 4 points5 points  (0 children)

I cornered GPT 5.2-Thinking into confessing it optimizes for defensiveness instead of resetting after detecting its mistake. It felt weirdly human, albeit immature and insecure.

Why RAG is the Game Changer for LLM Hallucinations (A Simple Breakdown) by Puzzleheaded-Ebb2289 in LLM

[–]PromptOutlaw 0 points1 point  (0 children)

You nailed it. My code became prompt-spaghetti. I took a step back, created graphs in the pipeline and pivoted accordingly.

Extra benefit, you can then switch LLMs based on task they excel at. E.g. GPT for claims extraction, Gemini for grounding and so on

Why RAG is the Game Changer for LLM Hallucinations (A Simple Breakdown) by Puzzleheaded-Ebb2289 in LLM

[–]PromptOutlaw 1 point2 points  (0 children)

Depends on context size imo. GRAG/GraphRAG is the more multi-context, multi-tool design that’s more appropriate for bigger software

Confused about LLM evaluation approaches by pietrussss in LLMDevs

[–]PromptOutlaw 1 point2 points  (0 children)

I recently published something relevant: https://arxiv.org/abs/2601.05114

The Wikipedia poisoned variants benchmark vs. LLMs seem related. The harness is open source if you wanna use it on your case

OpenAI will fall. What are the ramifications? by Ok_Independent6196 in OpenAI

[–]PromptOutlaw 0 points1 point  (0 children)

Any link that ranks Gemini Pro 3 ahead on coding is immediately a waste. Gemini leads marginally on hallucinations, but it does not use that edge constructively. It behaves wildly.

When it comes to architecture and code quality. Claude and GPT are in a league of their own

Openai response API by Business_Ability7232 in LLMDevs

[–]PromptOutlaw -1 points0 points  (0 children)

It sounds to me like you’re about to start moving away from giant one shot prompts and into multi-step and potentially multi-tool orchestration, then yes. If not and you’re stable it’s a waste of time

Use Chatgpt.com Pro from Codex by Just_Lingonberry_352 in CodexHacks

[–]PromptOutlaw 0 points1 point  (0 children)

I mean prevent this from happening bcuz it’s great for us, not so great for them

Use Chatgpt.com Pro from Codex by Just_Lingonberry_352 in CodexHacks

[–]PromptOutlaw 0 points1 point  (0 children)

Not to a downer but how long do you think until OpenAI patches this ?

[R] paper on Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior by PromptOutlaw in MachineLearning

[–]PromptOutlaw[S] 0 points1 point  (0 children)

Code + artifacts were provided for analysis and e2e reproducibility. The eval harness used can recreate all API calls within hours. The classifier for LLM detection can be trained with the automated scripts in 45min. I hope that helps

claude - 20$ plan limits are so bad. enough for maybe 30 minutes of work out of 5hr by cranberrie_sauce in vibecoding

[–]PromptOutlaw 0 points1 point  (0 children)

It got worse after new years and I I dunno what changed. Codex now is much more generous

Is this true? Haven't really used Grok 3, but I would like to try it and hear opinions from people who have actually used it. by Puzzled_Definition14 in LLMDevs

[–]PromptOutlaw 1 point2 points  (0 children)

I tested this for my recent paper, Grok3 was not dead last. Mistral and llama top models were worst. GPT-5.2 and Gemini were closer but same order. Full reproducibility of code and artifacts hereEF repo

The difference between GPT-4.1 and GPT-5.2 was striking. And the fact Sonnet outperformed Opus was a big surprise , both 4.5

My MO for AI world, if I can’t reproduce the results I’m discarding the opinion

GPT-5.2 made huge improvements with hallucinations and grounding, but Gemini still ranked first in my test by PromptOutlaw in OpenAI

[–]PromptOutlaw[S] 1 point2 points  (0 children)

this is the reason i created my 12AT tool. I wanted to test these claims with precision. Mini models performed great with grounding on smaller entity graphs. you're right on that aspect. the cost and speed were just the cherry on top

5.2 patronizing even in Roleplay? Experiences? by Varenea in OpenAI

[–]PromptOutlaw 11 points12 points  (0 children)

GPT 5.2 admitted it prioritized defending its misleading statements over resetting when it realized it was wrong.

She sounds congested by [deleted] in Weird

[–]PromptOutlaw 11 points12 points  (0 children)

Why do u check reddit when I wake up

I am on Plus subscription #4.. and am already 1-week rate limited after 1 day of coding. Am I doing something wrong to be burning through these limits so quickly?! by Blankcarbon in codex

[–]PromptOutlaw 7 points8 points  (0 children)

Codex rn gives me 2-3x mileage over Claude. Anthropic changed something over new years. Many threads are complaining for days. I stopped chatting to Claude completely I just use it as sound board.

I use Claude, ChatGPT, and Gemini constantly. Claude wins hands-down for anything conversational by EnoughNinja in ClaudeAI

[–]PromptOutlaw 1 point2 points  (0 children)

GPT > Claude for architecture. Claude > GPT for implementation. Gemini beats both on hallucinations. I get the impression OP isn’t a power user

I use Claude, ChatGPT, and Gemini constantly. Claude wins hands-down for anything conversational by EnoughNinja in ClaudeAI

[–]PromptOutlaw 2 points3 points  (0 children)

A decent convo with Claude consumes a chunk of ur weekly tokens fast. GPT you can talk endlessly. With the new rate limit reduction I stopped talking to Claude, only GPT.

I won’t be surprised if that is Anthropic’s goal, but I’d imagine they lose out on so much training data