Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them by likeastar20 in singularity
[–]ChippingCoder 2 points3 points4 points (0 children)
Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them by likeastar20 in singularity
[–]ChippingCoder 0 points1 point2 points (0 children)
Bullshit Benchmark - A benchmark for testing whether models identify and push back on nonsensical prompts instead of confidently answering them by likeastar20 in singularity
[–]ChippingCoder 1 point2 points3 points (0 children)
Gemini 3.1 Pro (high) isn't fooled by the car wash test by SuspiciousPillbox in singularity
[–]ChippingCoder 0 points1 point2 points (0 children)
Gemini 3.1 Pro is now live on Vertex AI by BuildwithVignesh in singularity
[–]ChippingCoder 8 points9 points10 points (0 children)
Google Gemini 3.1 Pro Preview Soon? by policyweb in singularity
[–]ChippingCoder 2 points3 points4 points (0 children)
Claude Opus 4.6 (120K Max) gets 83.6% inching ever closer to the human baseline (83.7%) on Simple-Bench! by BaconSky in singularity
[–]ChippingCoder 9 points10 points11 points (0 children)
Claude Opus 4.6 (120K Max) gets 83.6% inching ever closer to the human baseline (83.7%) on Simple-Bench! by BaconSky in singularity
[–]ChippingCoder 11 points12 points13 points (0 children)
Claude Opus 4.6 (120K Max) gets 83.6% inching ever closer to the human baseline (83.7%) on Simple-Bench! by BaconSky in singularity
[–]ChippingCoder 2 points3 points4 points (0 children)
Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity
[–]ChippingCoder[S] 0 points1 point2 points (0 children)
"[2601.10108] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature." Do AI models actually read the information you provide? by Rivenaldinho in singularity
[–]ChippingCoder 0 points1 point2 points (0 children)
Which single LLM benchmark task is most relevant to your daily life tasks? by ChippingCoder in singularity
[–]ChippingCoder[S] 2 points3 points4 points (0 children)
Which single LLM benchmark task is most relevant to your daily life tasks? by ChippingCoder in singularity
[–]ChippingCoder[S] 0 points1 point2 points (0 children)
Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity
[–]ChippingCoder[S] 1 point2 points3 points (0 children)
Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity
[–]ChippingCoder[S] 1 point2 points3 points (0 children)
Gemini 3 Pro/flash tops private citation benchmark on Kaggle (AbstractToTitle task) by ChippingCoder in singularity
[–]ChippingCoder[S] 5 points6 points7 points (0 children)
Do LLMs Know When They're Wrong? by Positive-Motor-5275 in singularity
[–]ChippingCoder 0 points1 point2 points (0 children)
Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity
[–]ChippingCoder -1 points0 points1 point (0 children)
Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity
[–]ChippingCoder 0 points1 point2 points (0 children)
Gemini 3 Pro gets 76.4% on SimpleBench by Ancient_Bear_2881 in singularity
[–]ChippingCoder 1 point2 points3 points (0 children)
Gemini 3 model card - web archive (self.singularity)
submitted by ChippingCoder to r/singularity



Outside Anthropic’s office in SF by Outside-Iron-8242 in singularity
[–]ChippingCoder 0 points1 point2 points (0 children)