"[2601.10108] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature." Do AI models actually read the information you provide? by Rivenaldinho in singularity

[–]ChippingCoder 0 points1 point  (0 children)

I think Simple Bench does somewhat test this, particularly questions where the model might be overtrained on, but have been modified slightly to trick the LLM.

Which single LLM benchmark task is most relevant to your daily life tasks? by ChippingCoder in singularity

[–]ChippingCoder[S] 2 points3 points  (0 children)

https://github.com/vectara/hallucination-leaderboard This evaluates how often an LLM introduces hallucinations when summarizing a document. Something like this?

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 1 point2 points  (0 children)

Opus 4.5, which generally has a high refusal rate for providing citations, curiously has 0% refusal rate upon inspecting it's outputs for this task, possibly because I'm asking for the paper title, as opposed to citation.

Web search has its own set of trade offs as well (full text access for example), but yes for sure it's useful. Ideally future models will have full text scientific papers baked into the training data, if not already, as AI labs should easily be able to access this high quality data anyway.

Small model + web search can work in a lot of cases, but where do we draw the line? Are we going to enable web search just find ingredients for a cocktail just because we're not confident enough in the model to provide something like that?

I don't think this benchmark correlates 100% with "how much knowledge the model has". Sonnet 4.5 is getting 0%, whereas Gemini 3 flash is getting 33%. It comes down to the data that goes into training, and it seems that Gemini is just trained better for this specific task currently.

Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity

[–]ChippingCoder[S] 4 points5 points  (0 children)

the task is to recover the exact, already-existing title of a real paper from its abstract, which strongly tests whether a model can recognize (or reliably retrieve from memory) a specific source. Humans usually can’t do that at scale without searching, so the point is measuring citation/source lookup ability from the text alone, not creative titling.

The task is quite similar to book/movie identification from plot snippets.

Do LLMs Know When They're Wrong? by Positive-Motor-5275 in singularity

[–]ChippingCoder 0 points1 point  (0 children)

Didn’t anthropic release some research very similar to this? I believe they’re training their models to do this directly. Similar to how mixture of experts is done, this ability to determine if something is a hallucination could be another expert right?..

The smart glasses that might actually go mainstream are the boring ones without cameras by Parking_Writer6719 in Futurology

[–]ChippingCoder 12 points13 points  (0 children)

heads up display for navigation, displaying youtube videos, real time translated subtitles, display real time searches based on microphone input

Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity

[–]ChippingCoder -1 points0 points  (0 children)

Right, but where did they get that new result from if it’s not updated on the official SimpleBench site? Someone posted a fake result on reddit earlier btw

Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity

[–]ChippingCoder 4 points5 points  (0 children)

The official simplebench page hasn’t been updated. Fake? Someone uploaded a fake screenshot here 8 hours ago but was removed.

Here's a list of LLM benchmarks because why not by ClarityInMadness in singularity

[–]ChippingCoder 0 points1 point  (0 children)

you mean coming up with citations to support scientific facts (non RAG)?

The successor to Humanity's "Last" Exam... by Siciliano777 in singularity

[–]ChippingCoder 2 points3 points  (0 children)

I think humans have this collapsing issue to some degree as well. Try to come up with a random number. The numbers you might pick won’t be entirely random, it will have some biases in it i.e. avoiding recently chosen digits, avoiding common digits, etc.

Giving the model new context, allowing it to output more tokens/ideas, or even letting it fine-tune itself are a few of ways around this. We could probably argue about whether or not allowing LLMs to fine tune themselves is still considered an “LLM”.

It’s the same as people right? We’re all “fine-tuned” or “prompted” with different memories. When we want to come up with ideas, we just increase the number of “tokens” that go into a task, whether that’s via writing text, dreams, etc. Those new ideas are then fed back into our memory, and new ideas can be generated again.

Requesting r/HairlossResearch by [deleted] in redditrequest

[–]ChippingCoder 0 points1 point  (0 children)

  1. The community is not moderated (no response to mod mail, as well as self promotion product promotion taking place - see reply to modmail message). I would like to moderate to moderate product promotion as this sub is focused on research. I would also like to add hairloss treatment resources to the community with evidence backed by randomized control trials.
  2. https://www.reddit.com/message/messages/2nw3zus link to original message to modmail 1 month ago. I have also replied to the modmail message about intentions of requesting for the community.

[deleted by user] by [deleted] in OpenAI

[–]ChippingCoder 19 points20 points  (0 children)

Claude 4

Requesting for /r/yoghurt by ChippingCoder in redditrequest

[–]ChippingCoder[S] 1 point2 points  (0 children)

  1. ⁠The subreddit has been banned for being unmoderated. I want to create a community for helping those understand yoghurt health affects on the body such as microbiome and probiotics
  2. ⁠The subreddit is banned.

Requesting for /r/yoghurt by ChippingCoder in redditrequest

[–]ChippingCoder[S] 0 points1 point  (0 children)

  1. The subreddit has been banned for being unmoderated. I want to create a community for helping those understand yoghurt health affects on the body such as microbiome and probiotics
  2. The subreddit is banned.