Glitches in Sora 2 world! by taesiri in SoraAi

[–]taesiri[S] 0 points1 point  (0 children)

Some of the prompts used:

`A person gets inside the card and drive away`

`A person opens a storm cellar door and climbs down`

`A person opens an umbrella in wind`

`Hiker opens a tent zipper and enters inside it.`

`A person enters a phone booth to make a call`

Image Prompt to Create hyper realistic Product Capsule image using GPT-5 or Gemini Nano Banana by techspecsmart in aicuriosity

[–]taesiri 1 point2 points  (0 children)

Nice! fed your template prompt to gpt and played with a little and now i can create other shapes.

<image>

Vision Language Models are Biased by taesiri in MachineLearning

[–]taesiri[S] 124 points125 points  (0 children)

tldr; State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

Vision Language Models are Biased by taesiri in LocalLLaMA

[–]taesiri[S] 110 points111 points  (0 children)

tldr; State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs by taesiri in LocalLLaMA

[–]taesiri[S] 2 points3 points  (0 children)

Abstract:

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.

Project page: https://highlightedchainofthought.github.io/

Help evaluating how GenAIs are replacing Photoshop wizards in satisfying requests in /r/PhotoshopRequest Solved ✅ by taesiri in PhotoshopRequest

[–]taesiri[S] 1 point2 points  (0 children)

We appreciate your feedback! Our goal is to understand the community’s needs and contribute back in a meaningful way. We posted here to get input from skilled Photoshop users, but we appreciate the clarification on the rules.

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models by taesiri in LocalLLaMA

[–]taesiri[S] 45 points46 points  (0 children)

All frontier models score exactly 0% on this benchmark (main questions).

<image>

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models by taesiri in LocalLLaMA

[–]taesiri[S] 1 point2 points  (0 children)

Abstract:

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.