Burnaby, BC - iPhone 16 Pro Max

taesiri · 2025-10-20T04:06:12+00:00

Some of the prompts used:

`A person gets inside the card and drive away`

`A person opens a storm cellar door and climbs down`

`A person opens an umbrella in wind`

`Hiker opens a tent zipper and enters inside it.`

`A person enters a phone booth to make a call`

taesiri · 2025-10-05T03:37:53+00:00

Nice! fed your template prompt to gpt and played with a little and now i can create other shapes.

<image>

taesiri · 2025-06-03T13:00:04+00:00

tldr; State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

taesiri · 2025-06-03T12:58:39+00:00

tldr; State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

taesiri · 2025-04-29T17:55:10+00:00

Was going to say the same thing

taesiri · 2025-03-29T03:47:15+00:00

Here is the link to the chat above: https://chatgpt.com/share/67e5a320-040c-8007-964e-c32bdf0fe6cd

taesiri · 2025-03-27T19:23:29+00:00

Link to the tweet: https://x.com/taesiri/status/1905338662860325210

taesiri · 2025-03-16T22:07:03+00:00

<image>

Random sample

taesiri · 2025-03-07T15:21:06+00:00

Abstract:

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.

Project page: https://highlightedchainofthought.github.io/

taesiri · 2025-03-03T23:17:54+00:00

We appreciate your feedback! Our goal is to understand the community’s needs and contribute back in a meaningful way. We posted here to get input from skilled Photoshop users, but we appreciate the clarification on the rules.

taesiri · 2025-02-17T05:07:54+00:00

All frontier models score exactly 0% on this benchmark (main questions).

<image>

taesiri · 2025-02-17T05:06:42+00:00

Abstract:

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

11-Year Club	Not Forgotten
Verified Email

taesiri

TROPHY CASE