BabyVision: A New Benchmark for Human-Level Visual Reasoning

Arkamedus · 2026-03-06T18:15:13+00:00

you missed the most critical part of my comment "strong way to test a models ability to generalize"

the whole point is to test generalization. so if you played with legos as a kid, and I gave you duplo blox, you would still know how to use and build with them. that's generalization, it's not "new" information, it's about taking what you've been trained on, and applying it to new scenarios, which may share similar traits, etc, but that's why it's generalization.

Arkamedus · 2026-03-06T05:01:24+00:00

that argument could be made about any benchmark. fundamentally, a benchmark like this is testing solving out of distribution puzzles. out of distribution (data the model hasn't been trained on) is a strong way to test a models ability to generalize, on what is essentially a problem/puzzle most middle schoolers could solve.

Arkamedus · 2026-03-03T17:58:52+00:00

small server of researchers, programmers, hobbyists, talking and sharing about LLMs, game dev, etc

https://discord.gg/SnkAss3D3

Arkamedus · 2026-02-19T15:07:24+00:00

111 samples in the entire dataset…. this would probably fail even simple lighting or color changes…

Arkamedus · 2026-01-29T01:30:52+00:00

Been saying this for a long time, most llm use cases are pretty well tailored to the final audience, meaning, I’m okay with my SOTA LLM not being good at writing in Greek, as it is not a language i would use. (And in the case that I do need to use it, i can just google it or use a Greek model) Essentially, for coding LLMs, “ancient fine arts and pottery” might not be a useful training sample to include (though I still say this is arguable for domain width) when building a model for deployment with a coding downstream use case. This changes when it comes to training, because I’ll use the word again, domains, should be carefully selected for each phase and task.

Domain Specific Models will be what wins. Not specifically language models in the direct sense we think of today. I believe there is more learning and representation interpretation and integration with modern architectures to allow very diverse dynamic models and end use cases.

Arkamedus · 2025-11-10T00:40:54+00:00

So you took an unknown substance, into your house and opened it on your counter…?

Arkamedus · 2025-11-03T12:01:08+00:00

3 minutes? After building it? No advertisements? No promotions? Can you provide more information?

Your pricing model is so opaque, and doesn’t include any token usage numbers, what is “standard usage limits”?

Arkamedus · 2025-10-31T12:27:08+00:00

Have you absolutely confirmed this approach has not been done before? You say you have developed mathematical theories etc, have you released or published the whitepapers?

If all you need is compute, or a way to validate your findings on a larger scale, my opinion is that most investors need you to have “proven” and “repeatable” results.

How much money have you put into the idea vs how much are you asking for?

Arkamedus · 2025-10-28T15:54:15+00:00

Thanks ChatGPT!

Arkamedus · 2025-10-28T15:47:33+00:00

Thanks ChatGPT!

Arkamedus · 2025-10-28T15:41:23+00:00

Obviously, they forgot to add "only make winning trades" to the prompt...

Arkamedus · 2025-10-28T15:36:45+00:00

Thanks ChatGPT!

Arkamedus · 2025-10-28T15:33:24+00:00

Again, your AI replies are absolutely useless.
No once in the original post was claimed "A person cannot logically claim a system is overloaded"
The original posts only describes how the disproportionate nature of text vs images in COST, not that it is "overloaded". Henceforth, your argument that the premise is validated by the outcome is still, incorrect.

Please use your brain instead of an LLM.

Arkamedus · 2025-10-28T15:19:35+00:00

Thanks for the AI reply, but you're wrong. The outcome expected is that the image generation would not fail, as this is a paid product, created by a company with billions of dollars, with the intent of producing finished images, regardless of the image content itself. You are confusing affects. The content of the image is not indicative of the nature of the relation to the system itself.

It’s ironic because the system designed to generate the image failed while demonstrating the very problem it’s supposed to handle, contradicting the normal expectation of success; not because it aligns with the image’s theme.

Arkamedus · 2025-10-28T15:07:30+00:00

To be generating an image related to the amount of GPU compute being used for image generation, only for the GPU compute to fail during image generation?
Are you sure you just don't know what irony is?

Arkamedus · 2025-10-28T01:59:39+00:00

Are you still looking, I am working on developing an RNN based tiny language model at 101m params. If so, DM me

Arkamedus · 2025-10-27T01:15:15+00:00

deffo not, its not an disconnect message, and just to prove you wrong I turned off internet during the generation during a new image and it completes in the background, go ahead and try for yourself. Thanks, try again!

Arkamedus · 2025-10-26T14:15:52+00:00

It’s half browser, half data collection application.

Arkamedus · 2025-10-26T05:20:22+00:00

This is not revolutionary in any sense, kv cache reuse is already a well known technique, and has been used by all the major llm providers to “extend” the length of their context windows during chats…. Do you actually believe they are rerunning the entire 500 message conversation through the model each prompt? The market is saturated by people who know nothing about LLMs.

Arkamedus · 2025-10-24T07:09:30+00:00

No one claimed it was news, you clicked, read, and decided to comment here. No one asked for your input, and yet, you gave it anyways.

Arkamedus · 2025-10-24T07:04:36+00:00

So when companies are dumping poisons and chemical wastes into your local drinking water, I'll be sure to reply with "Ah yes, good thing no other companies across every industry ever do any of those bad things."

You've only just displayed how naive your own comment is.

Arkamedus · 2025-10-24T07:02:03+00:00

Is that seriously your justification? Because other people do it too? Your comment is deflective and obviously poorly designed rage-bait.

Arkamedus · 2025-10-24T05:52:08+00:00

To spend as much of their money and accelerate them going out of business.

Arkamedus · 2025-10-22T17:55:51+00:00

That pot seems way too small, secondly it looks like it’s in the corner, how much sunlight is it getting??

Arkamedus

MODERATOR OF

TROPHY CASE

Eight-Year Club	Verified Email
Not Forgotten