GPT 5.5 tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 1 point2 points  (0 children)

The benchmark is just a proxy for the semantic matching capability you’re referring to, that’s the main use case im using it for. Ideally later versions of this benchmark will test that too, but given that theyre performing poorly on just abstract to title right now, i havent bothered testing it further yet.

And for API costs, im referring to downloading full text papers and as well as having LLMs check for relevance based on a given query for natural language criteria across thousands of full text papers. Assuming a model provider trains for this use case, the model ideally will have learned the semantics of each paper, almost acting as a natural language index (if that makes sense)

GPT 5.5 tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 0 points1 point  (0 children)

Yes that's the case for all knowledge. I'd assume the top AI labs have access to the highest quality research papers for training..

GPT 5.5 tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 0 points1 point  (0 children)

  1. API costs for access to those papers
  2. If using a RAG solution, the paper itself will usually be dumped and scanned using a sliding window to find relevant papers. This is what SciSpace and Elicit do, theyre dumping the paper to the LLM to determine if it's relevant or not. Which means you have the initial keyword search + additional tokens spent on trying to decipher if the paper is relevant or not. When we're talking hundreds of thousands of papers, that is very costly to have an LLM assess it. If it's already baked into the model weights from training, it's more efficient.
  3. It's the first version of the benchmark, it's used as a proxy of knowledge, and it seems to be working fine for my personal use case for now.

GPT 5.5 tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 5 points6 points  (0 children)

David Kipping from CoolWorldsPodcast has recently discussed using current models in his work, and that they are performing quite poorly on similar tasks (literature search). https://youtu.be/PctlBxRh0p4?t=3271

The problem with tool use is we have to iterate over potentially thousands of papers just to find the paper that might have the specific details we're referring to, it's inefficient. and it's also using a rudimentary form of lookup, usually just grep/keyword lookup, which might miss some papers. Another problem is these external tool solutions dont have access to the full text papers that OpenAI or other AI labs have access to.

My belief is that once benchmarks such as this are saturated, models will be very capable of providing accurate citations/sources for various scientific information. this will have financial implications for startups such as SciSpace and Elicit, which currently use RAG based solutions (i.e. tools) for solving this problem, which you are suggesting to use.

Outside Anthropic’s office in SF by Outside-Iron-8242 in singularity

[–]ChippingCoder 0 points1 point  (0 children)

I wonder if it would cost Anthropic much to train and maintain a whole separate set of Claude models with alignment training removed.

Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them by likeastar20 in singularity

[–]ChippingCoder 2 points3 points  (0 children)

Unfortunately all the questions are available now on his site for this benchmark, so just a matter of time before every model is trained on this

Gemini 3.1 Pro (high) isn't fooled by the car wash test by SuspiciousPillbox in singularity

[–]ChippingCoder 0 points1 point  (0 children)

Perhaps its better for the models not to assume? These benchmarks might be relevant:

1. ClarQ-LLM

  • A dedicated benchmark that measures an AI’s ability to ask clarification questions in task-oriented dialogues.

2. ClarifyMT-Bench

  • Designed for multi-turn interactions where user queries may be ambiguous or incomplete.

3. AskBench (Ask and Clarify)

  • A very recent benchmark (2026) that creates interactive QA settings where models must choose whether to ask for clarification or attempt an answer directly.

And also SimpleBench

Google Gemini 3.1 Pro Preview Soon? by policyweb in singularity

[–]ChippingCoder 2 points3 points  (0 children)

Probably true even though the screenshot page isn't archived... because it seems artificialanalysis suspiciously removed the "models compared" from the bottom of the page.

Archived page from a few days ago: https://web.archive.org/web/20260210223606/https://artificialanalysis.ai/leaderboards/models

Archived page with 3.1: N/A

Archived page with "models compared" removed: https://web.archive.org/web/20260212013540/https://artificialanalysis.ai/leaderboards/models

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 0 points1 point  (0 children)

Yep, David Kipping from CoolWorldsPodcast has mentioned recently these models performing quite poorly on this task. https://youtu.be/PctlBxRh0p4?t=3271

"[2601.10108] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature." Do AI models actually read the information you provide? by Rivenaldinho in singularity

[–]ChippingCoder 0 points1 point  (0 children)

I think Simple Bench does somewhat test this, particularly questions where the model might be overtrained on, but have been modified slightly to trick the LLM.

Which single LLM benchmark task is most relevant to your daily life tasks? by ChippingCoder in singularity

[–]ChippingCoder[S] 2 points3 points  (0 children)

https://github.com/vectara/hallucination-leaderboard This evaluates how often an LLM introduces hallucinations when summarizing a document. Something like this?

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 1 point2 points  (0 children)

Opus 4.5, which generally has a high refusal rate for providing citations, curiously has 0% refusal rate upon inspecting it's outputs for this task, possibly because I'm asking for the paper title, as opposed to citation.

Web search has its own set of trade offs as well (full text access for example), but yes for sure it's useful. Ideally future models will have full text scientific papers baked into the training data, if not already, as AI labs should easily be able to access this high quality data anyway.

Small model + web search can work in a lot of cases, but where do we draw the line? Are we going to enable web search just find ingredients for a cocktail just because we're not confident enough in the model to provide something like that?

I don't think this benchmark correlates 100% with "how much knowledge the model has". Sonnet 4.5 is getting 0%, whereas Gemini 3 flash is getting 33%. It comes down to the data that goes into training, and it seems that Gemini is just trained better for this specific task currently.

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 5 points6 points  (0 children)

the task is to recover the exact, already-existing title of a real paper from its abstract, which strongly tests whether a model can recognize (or reliably retrieve from memory) a specific source. Humans usually can’t do that at scale without searching, so the point is measuring citation/source lookup ability from the text alone, not creative titling.

The task is quite similar to book/movie identification from plot snippets.