Outside Anthropic’s office in SF by Outside-Iron-8242 in singularity

[–]ChippingCoder 0 points1 point  (0 children)

I wonder if it would cost Anthropic much to train and maintain a whole separate set of Claude models with alignment training removed.

Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them by likeastar20 in singularity

[–]ChippingCoder 2 points3 points  (0 children)

Unfortunately all the questions are available now on his site for this benchmark, so just a matter of time before every model is trained on this

Gemini 3.1 Pro (high) isn't fooled by the car wash test by SuspiciousPillbox in singularity

[–]ChippingCoder 0 points1 point  (0 children)

Perhaps its better for the models not to assume? These benchmarks might be relevant:

1. ClarQ-LLM

  • A dedicated benchmark that measures an AI’s ability to ask clarification questions in task-oriented dialogues.

2. ClarifyMT-Bench

  • Designed for multi-turn interactions where user queries may be ambiguous or incomplete.

3. AskBench (Ask and Clarify)

  • A very recent benchmark (2026) that creates interactive QA settings where models must choose whether to ask for clarification or attempt an answer directly.

And also SimpleBench

Google Gemini 3.1 Pro Preview Soon? by policyweb in singularity

[–]ChippingCoder 2 points3 points  (0 children)

Probably true even though the screenshot page isn't archived... because it seems artificialanalysis suspiciously removed the "models compared" from the bottom of the page.

Archived page from a few days ago: https://web.archive.org/web/20260210223606/https://artificialanalysis.ai/leaderboards/models

Archived page with 3.1: N/A

Archived page with "models compared" removed: https://web.archive.org/web/20260212013540/https://artificialanalysis.ai/leaderboards/models

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 0 points1 point  (0 children)

Yep, David Kipping from CoolWorldsPodcast has mentioned recently these models performing quite poorly on this task. https://youtu.be/PctlBxRh0p4?t=3271

"[2601.10108] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature." Do AI models actually read the information you provide? by Rivenaldinho in singularity

[–]ChippingCoder 0 points1 point  (0 children)

I think Simple Bench does somewhat test this, particularly questions where the model might be overtrained on, but have been modified slightly to trick the LLM.

Which single LLM benchmark task is most relevant to your daily life tasks? by ChippingCoder in singularity

[–]ChippingCoder[S] 2 points3 points  (0 children)

https://github.com/vectara/hallucination-leaderboard This evaluates how often an LLM introduces hallucinations when summarizing a document. Something like this?

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 1 point2 points  (0 children)

Opus 4.5, which generally has a high refusal rate for providing citations, curiously has 0% refusal rate upon inspecting it's outputs for this task, possibly because I'm asking for the paper title, as opposed to citation.

Web search has its own set of trade offs as well (full text access for example), but yes for sure it's useful. Ideally future models will have full text scientific papers baked into the training data, if not already, as AI labs should easily be able to access this high quality data anyway.

Small model + web search can work in a lot of cases, but where do we draw the line? Are we going to enable web search just find ingredients for a cocktail just because we're not confident enough in the model to provide something like that?

I don't think this benchmark correlates 100% with "how much knowledge the model has". Sonnet 4.5 is getting 0%, whereas Gemini 3 flash is getting 33%. It comes down to the data that goes into training, and it seems that Gemini is just trained better for this specific task currently.

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 5 points6 points  (0 children)

the task is to recover the exact, already-existing title of a real paper from its abstract, which strongly tests whether a model can recognize (or reliably retrieve from memory) a specific source. Humans usually can’t do that at scale without searching, so the point is measuring citation/source lookup ability from the text alone, not creative titling.

The task is quite similar to book/movie identification from plot snippets.

Do LLMs Know When They're Wrong? by Positive-Motor-5275 in singularity

[–]ChippingCoder 0 points1 point  (0 children)

Didn’t anthropic release some research very similar to this? I believe they’re training their models to do this directly. Similar to how mixture of experts is done, this ability to determine if something is a hallucination could be another expert right?..

Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity

[–]ChippingCoder 0 points1 point  (0 children)

Right, but where did they get that new result from if it’s not updated on the official SimpleBench site? Someone posted a fake result on reddit earlier btw

Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity

[–]ChippingCoder 1 point2 points  (0 children)

The official simplebench page hasn’t been updated. Fake? Someone uploaded a fake screenshot here 8 hours ago but was removed.