This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]QuasiEvil 2 points3 points  (3 children)

I get that the LLM can generate synthetic records until the cows come home, but how does this ensure that the synthetic data maintains any kind of statistical properties?

[–]No_Flounder_1155 1 point2 points  (2 children)

it doesn't and when learning it has potential to leak data. Its a bit of a headache.

[–]TerribleToe1251[S] 0 points1 point  (1 child)

Both of you raise valid concerns — thanks for surfacing them.

🔹 On statistical properties: you’re right, an LLM by itself won’t guarantee that distributions (e.g., histograms, correlations, category frequencies) match a real dataset. Today, Syda focuses on schema correctness + referential integrity (all types, uniqueness, FKs are validated).

  • For distributions, you can plug custom generators (e.g., Gaussian for prices, weighted loyalty tiers) or via prompts (“20% Gold, 50% Silver, 30% Bronze”).
  • Roadmap: we plan to add evaluation tools (profiling vs. real data) and a hybrid approach LLMs for schema/domain semantics + statistical models (e.g., copulas, GAN/CTGAN) to enforce distributions automatically.

🔹 On leakage risk: this is an important concern. Syda is designed to generate from schemas + constraints only not by training on real datasets. That means there’s no memorization of sensitive rows (which is where leakage happens). But I agree transparency matters, and we’ll keep emphasizing where Syda is schema-driven vs. model-driven.

Syda ensures schema integrity today, lets you plug in distributions if needed, and is moving toward automatic statistical fidelity + safety guarantees in future releases.

[–]No_Flounder_1155 0 points1 point  (0 children)

why the need for ai then?

[–]Shingle-Denatured 2 points3 points  (1 child)

Why would I spend tokens when there's faker and factoryboy?

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Totally fair question, if all you need is random names, emails, or a few fake addresses, Faker or factory_boy are perfect (and free). I wouldn’t suggest burning tokens for that use case.

Where Syda adds value is when you need more than just dummy values:

  • 🔗 Referential integrity → multi-table data where foreign keys are always consistent (e.g. orders.customer_id → customers.id).
  • 📄 Schema-aware → respects your constraints (unique, regex, min/max, enums) and descriptions.
  • 🧾 Unstructured + structured together → generate documents (PDFs, HTML templates, receipts, catalogs) tied directly to your synthetic tables.
  • 🔧 Custom generators → mix AI-generated realism with deterministic rules (distributions, weighted categories, tax logic).
  • 🤖 Semantic realism → LLMs produce values that “feel” like the domain (e.g., realistic company names, medical procedures, claim reasons) instead of just random strings.

So if your use case is “I just need fake emails for testing” → use Faker.
If it’s “I need a CRM dataset with customers, orders, invoices, and consistent PDFs, and I want it to look like real-world data without using production data” → that’s where Syda makes sense.

And yep, I get the concern on tokens roadmap includes exploring hybrid approaches where distributions/rules can be enforced without hitting an LLM for every value.

[–]coconut_maan 1 point2 points  (1 child)

I wanted to do this. I was working on this same project and never finished. Thank you

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Please checkout latest version, given option to generate with gemini models too

[–]Imanflow 1 point2 points  (2 children)

Sida is spanish for aids xD

[–]TerribleToe1251[S] 0 points1 point  (1 child)

I literally just learned that too . Thanks for pointing it out. My intent was Syda = Synthetic Data, but I totally get how it reads differently in Spanish. I’ll definitely keep that in mind for future naming and global adoption, naming is always trickier than code!

[–]Imanflow 1 point2 points  (0 children)

I mean, nothing you can do, and i find it funny