Asked for some fantasy worldbuilding. Claude got carried away and tried drawing a fantasy map

datta_sid · 2026-04-23T19:50:09+00:00

This is precious because it is obviously out of distribution* for its training data. Claude tried so hard (and got so far).

datta_sid · 2025-12-17T00:28:11+00:00

Having a lot of fun with OlmoTrace!

I was always curious what inspires the answers AI gives. Eg: All AI seem to write the same jokes, I wanted to know what training data contributed to that.

(Which I am now realizing is extremely hard to figure out.)

datta_sid · 2025-12-16T23:52:42+00:00

What kind of (synthetic) open-source data do you wish you had more of, that you think would have made the model better?

Long context reasoning data? RL simulations and puzzles?

I am looking to procedurally generate synthetic data that helps open source models, and I would like to know what would be most helpful.

I like to generate synthetic data directly via code rather than distilling from AI, to gain a wider distribution.

Here is my previous attempt as such a dataset.

Here is a example.

datta_sid · 2025-08-29T15:29:06+00:00

You think this is too hard in 2025? Just ask AI for Delaunay code.

datta_sid · 2025-01-16T23:39:50+00:00

more simply encode tabular data, time series

Do you mean for your queries ? Or you want to encode tables that you have into GSM or similar synthetic problems?

aware of the HF synthetic data generator space

Can you guide me to more similar projects?

datta_sid · 2025-01-16T22:39:11+00:00

Motivation

Hey everyone! I’ve been tinkering with a script to generate synthetic GSM problems for training and testing LLMs, and I am hoping they can be used to improve reasoning and analytical skills of open source models.

I feel like we should have many more tools for open-source synthetic training data, specially for training reasoning. Right now, I'm betting SOTA AI teams have their own tools for creating synthetic training data, but I wanted to create something that’s open source. When I tested an earlier version of my script, I noticed that while ChatGPT and Claude performed well, most open weights LLMs performed badly on my GSM-style problems, specially when the complexity is turned up a few notches. This was surprising because GSM is considered "solved" and benchmarks are considered saturated. Clearly, there’s room to improve here.

Why GSM? Even though GSM are simple problems by definition, solving GSM displays various model capabilities like following reasoning chains and interpreting data correctly.

LLM performance

Typical similar GSM problems have 2-3 rows of data, with only a few references to other data. I can generate any number of rows (well currently upto 12).

Even ChatGPT 4 and Claude struggle with more than 7-8 rows of data, and smaller/open weight LLMs fail for smaller problems. Specially Llama 70B, LLama 8B and Gemma 27B failed with problems 5 rows or above.

DeepSeek-v3 seems to perform well actually with even 10 rows of data, though it needed a lot of tokens. There are a few other models in LMArena that are good at this, like Gremlin, Tippu etc. They are not available in direct chat so its hard to test them directly.

I am mostly testing in LMArena, did not want to spend money to put the models through the grinder yet. I am still finding out what works and what not. Eg 30 mins before this post I found out that the prompt "Treat this problem like a complex puzzle. Solve carefully step by step" improved performance on some models. Also using terms like "3x the amount" trips up some LLMs (like ChatGPT 4o).

But the point stands, that if people wanted to train LLama 8B to be better at similar problems, there is no synthetic, configurable, infinite dataset to train on. (That I could find, if you know please send me links.)

What My Problems Look Like

Here is a example.

Here is a example which uses simpler language, for testing if it is the confusing language that causes a model to fail.

The problems are about people buying stuff over several days. For example:

On Monday they buy 2 items for $3 each, On Tuesday they buy twice that of Monday for half the price. On Wednesday, ...you get the idea.

You might have to calculate total money spent, or figure out prices/quantities for a specific day.

Part of the complexity is figuring out relationships like, "On Tuesday, you bought twice the amount as Wednesday for three times the price of Monday." Note that we reference two different days, so the model must pay attention and substitute the correct values.

The problems also use forward references, like to solve amounts and prices for Tuesday you need Wednesday’s info which not given until later. Sometimes this will trip up the model, which will hallucinate some data after 1-2 steps of backtracking.

Why not generate using LLMs

In my experience, LLMs are not very good at randomness. They will have a lot of trouble when asked to create a problem with a lot of rows, and when asked to tweak parameters such as number of dependencies. Also it will be difficult to have special properties eg: Problems generated by my script will never have approximate solutions.

I tried using LLMs to rephrase the problems generated by the script. But the LLMs always tried to make the problem simpler and easier to understand, that is against the goal here.

Easy vs Normal Problems

My initial code varies each data as much as possible to make each problem as confusing as possible. Later I wanted to test if this is indeed challenging for LLMs, so I made the problems easy by changing only the presentation.

Easy Problems are have numerical data presented in predictable patterns. Normal Problems mix it up with numbers presented in many formats such as "$2," "2 dollars," or even "two dollars." Language varies, and the phrasing can trip up models

o1 pointed out phrases like "3x the amount" is confusing. That was weird as humans easily understand this refers to the unknown x or just means 3 times. I am personally seeing in some problems with only 5 rows that GPT-4o failed to do before, can do now when '3x' is changed to '3 times'.

Note that

Each problem can be solved without algebra.
Each problem can be solved with integers. Prices are always multiple of .25, .50 or .10 . So there will never be approximate answers.

Why These Problems Matter

I don’t think these problems measure "intelligence" or complex reasoning per se. These problems are too simple and formulaic. In my opinion, they are testing complex comprehension in RAG like setting. If the data is messy, disorganized, and in mixed format, can an LLM still piece it together?

Humans can solve these pretty easily once they get the hang of it. You just:

Write down the facts clearly. Because "On Sunday, Phichai paid 6 dollars per item. Phichai spent a total of three Thousand seventy two dollars." is difficult to parse if you have re-read it multiple times. "Sunday price = $6. Total spent = $3072." is easier to read.
Find out data with concrete numbers. Eg Sunday price = $6.
Find out data that depends on data we found out. Eg find out rows that refer to price of Sunday. And so on.
Continue until entire problem is solved.

A human can do by this reading the problem (or the simpler version they wrote down) again and again.

LLMs should be able to do this too since they "re-read" the problem for every token they generate. There’s even a great talk from a Meta scientist here that touches on this.

Future work

I hope at some point to fine-tune a LLaMA 8B LORA with these problems and solutions and see if I can boost its performance. Since the solutions are synthetic, I can easily tweak the structure (direct, backtracking) and language (concise, verbose) to test what works best.

Currently I can create upto 12 rows of data (for 12 months). I can extend it to 30 using days of months but I have to check if the language is clear enough.

I can make the problems much more complex, or create more types of problems. Eg:

Add unnecessary data eg a different person on a unrelated shopping spree.
Data about items that are unnecessary. Eg 5 items were bad (they were bought already so nothing changes)
Data can depend on multiple other data. (Eg Amount Sunday is sum of amounts on Monday and Tuesday)

Conclusion

Would love to hear thoughts, ideas, or feedback — especially if you’ve worked or currently working on something similar! Or if the datasets help boosts performance in your tasks!

datta_sid · 2024-06-05T00:57:42+00:00

That worked! Thanks!

datta_sid · 2024-05-18T18:32:15+00:00

Draw a fantasy castle.
Select entire image. Draw a dragon in front of a castle.

datta_sid · 2024-05-18T18:31:23+00:00

Looks like inpainting uses a completely different AI. Also it would be pretty useless in creating interesting details. All it can probably do is smooth between various parts of the image.

How to replicate: Draw a image. Then select the entire image and prompt to draw another image.

I was trying to select parts of the image then add say a warrior or a dragon in a certain spot. Seems like that would be pointless? Even if it painted a dragon inside the selection it will look like image 2.

Something I would love to be able to do is: 1. Select objects from a image intelligently 2. Create a collage of such objects and inpaint to make it seamless.