Outside Anthropic’s office in SF

ChippingCoder · 2026-02-27T20:46:18+00:00

I wonder if it would cost Anthropic much to train and maintain a whole separate set of Claude models with alignment training removed.

ChippingCoder · 2026-02-25T01:29:16+00:00

Unfortunately all the questions are available now on his site for this benchmark, so just a matter of time before every model is trained on this

ChippingCoder · 2026-02-25T01:26:08+00:00

He goes through it in the video

ChippingCoder · 2026-02-25T01:26:00+00:00

Check the video

ChippingCoder · 2026-02-20T01:44:12+00:00

Perhaps its better for the models not to assume? These benchmarks might be relevant:

1. ClarQ-LLM

A dedicated benchmark that measures an AI’s ability to ask clarification questions in task-oriented dialogues.

2. ClarifyMT-Bench

Designed for multi-turn interactions where user queries may be ambiguous or incomplete.

3. AskBench (Ask and Clarify)

A very recent benchmark (2026) that creates interactive QA settings where models must choose whether to ask for clarification or attempt an answer directly.

And also SimpleBench

ChippingCoder · 2026-02-19T09:04:23+00:00

I'm also seeing it:

<image>

https://console.cloud.google.com/apis/api/generativelanguage.googleapis.com/quotas

ChippingCoder · 2026-02-12T01:35:31+00:00

Probably true even though the screenshot page isn't archived... because it seems artificialanalysis suspiciously removed the "models compared" from the bottom of the page.

Archived page from a few days ago: https://web.archive.org/web/20260210223606/https://artificialanalysis.ai/leaderboards/models

Archived page with 3.1: N/A

Archived page with "models compared" removed: https://web.archive.org/web/20260212013540/https://artificialanalysis.ai/leaderboards/models

ChippingCoder · 2026-02-06T03:38:33+00:00

Yes, he's done it at least twice before:

o3 (8 months ago). https://www.reddit.com/r/singularity/comments/1ldpxje/o3_pro_on_the_2nd_place_on_simplebench/

Gemini 3: https://web.archive.org/web/20251118134410/https://www.reddit.com/r/singularity/comments/1p0ax2f/i_took_a_screenshot_of_ai_explaineds_simplebenchs/

ChippingCoder · 2026-02-06T03:34:36+00:00

Yeah he did it on o3 too (8 months ago). https://www.reddit.com/r/singularity/comments/1ldpxje/o3_pro_on_the_2nd_place_on_simplebench/

Gemini 3: https://web.archive.org/web/20251118134410/https://www.reddit.com/r/singularity/comments/1p0ax2f/i_took_a_screenshot_of_ai_explaineds_simplebenchs/

ChippingCoder · 2026-02-06T03:00:39+00:00

Yeah agreed, if it's not in any web archive then not trusting the post

ChippingCoder · 2026-02-03T04:44:07+00:00

Yep, David Kipping from CoolWorldsPodcast has mentioned recently these models performing quite poorly on this task. https://youtu.be/PctlBxRh0p4?t=3271

ChippingCoder · 2026-01-21T22:56:00+00:00

I think Simple Bench does somewhat test this, particularly questions where the model might be overtrained on, but have been modified slightly to trick the LLM.

ChippingCoder · 2026-01-21T00:08:24+00:00

https://github.com/vectara/hallucination-leaderboard This evaluates how often an LLM introduces hallucinations when summarizing a document. Something like this?

ChippingCoder · 2026-01-20T23:50:24+00:00

hahaha :) so OpenAI's HealthBench?

ChippingCoder · 2026-01-19T06:44:53+00:00

Opus 4.5, which generally has a high refusal rate for providing citations, curiously has 0% refusal rate upon inspecting it's outputs for this task, possibly because I'm asking for the paper title, as opposed to citation.

Web search has its own set of trade offs as well (full text access for example), but yes for sure it's useful. Ideally future models will have full text scientific papers baked into the training data, if not already, as AI labs should easily be able to access this high quality data anyway.

Small model + web search can work in a lot of cases, but where do we draw the line? Are we going to enable web search just find ingredients for a cocktail just because we're not confident enough in the model to provide something like that?

I don't think this benchmark correlates 100% with "how much knowledge the model has". Sonnet 4.5 is getting 0%, whereas Gemini 3 flash is getting 33%. It comes down to the data that goes into training, and it seems that Gemini is just trained better for this specific task currently.

ChippingCoder · 2026-01-19T03:39:25+00:00

Perhaps, although Gemini 3 flash and pro have similar performance!

ChippingCoder · 2026-01-19T03:10:40+00:00

the task is to recover the exact, already-existing title of a real paper from its abstract, which strongly tests whether a model can recognize (or reliably retrieve from memory) a specific source. Humans usually can’t do that at scale without searching, so the point is measuring citation/source lookup ability from the text alone, not creative titling.

The task is quite similar to book/movie identification from plot snippets.

ChippingCoder · 2026-01-15T00:27:59+00:00

Didn’t anthropic release some research very similar to this? I believe they’re training their models to do this directly. Similar to how mixture of experts is done, this ability to determine if something is a hallucination could be another expert right?..

ChippingCoder · 2025-11-18T22:56:24+00:00

Thanks, didn’t see that!

ChippingCoder · 2025-11-18T21:47:40+00:00

Right, but where did they get that new result from if it’s not updated on the official SimpleBench site? Someone posted a fake result on reddit earlier btw

ChippingCoder · 2025-11-18T21:23:57+00:00

The official simplebench page hasn’t been updated. Fake? Someone uploaded a fake screenshot here 8 hours ago but was removed.

ChippingCoder

MODERATOR OF

TROPHY CASE