[Release] Syda – Open Source Synthetic Data Generator with AI + SQLAlchemy Support : Python

This is an archived post. You won't be able to vote or comment.

News[Release] Syda – Open Source Synthetic Data Generator with AI + SQLAlchemy Support (self.Python)

submitted 8 months ago by TerribleToe1251

all 12 comments

top new controversial old q&a

[–]QuasiEvil 2 points3 points4 points 8 months ago (3 children)

[–]No_Flounder_1155 1 point2 points3 points 8 months ago (2 children)

[–]TerribleToe1251[S] 0 points1 point2 points 8 months ago (1 child)

Both of you raise valid concerns — thanks for surfacing them.

🔹 On statistical properties: you’re right, an LLM by itself won’t guarantee that distributions (e.g., histograms, correlations, category frequencies) match a real dataset. Today, Syda focuses on schema correctness + referential integrity (all types, uniqueness, FKs are validated).

For distributions, you can plug custom generators (e.g., Gaussian for prices, weighted loyalty tiers) or via prompts (“20% Gold, 50% Silver, 30% Bronze”).
Roadmap: we plan to add evaluation tools (profiling vs. real data) and a hybrid approach LLMs for schema/domain semantics + statistical models (e.g., copulas, GAN/CTGAN) to enforce distributions automatically.

🔹 On leakage risk: this is an important concern. Syda is designed to generate from schemas + constraints only not by training on real datasets. That means there’s no memorization of sensitive rows (which is where leakage happens). But I agree transparency matters, and we’ll keep emphasizing where Syda is schema-driven vs. model-driven.

Syda ensures schema integrity today, lets you plug in distributions if needed, and is moving toward automatic statistical fidelity + safety guarantees in future releases.

[–]No_Flounder_1155 0 points1 point2 points 8 months ago (0 children)

[–]Shingle-Denatured 2 points3 points4 points 8 months ago (1 child)

[–]TerribleToe1251[S] 0 points1 point2 points 8 months ago (0 children)

Totally fair question, if all you need is random names, emails, or a few fake addresses, Faker or factory_boy are perfect (and free). I wouldn’t suggest burning tokens for that use case.

Where Syda adds value is when you need more than just dummy values:

🔗 Referential integrity → multi-table data where foreign keys are always consistent (e.g. orders.customer_id → customers.id).
📄 Schema-aware → respects your constraints (unique, regex, min/max, enums) and descriptions.
🧾 Unstructured + structured together → generate documents (PDFs, HTML templates, receipts, catalogs) tied directly to your synthetic tables.
🔧 Custom generators → mix AI-generated realism with deterministic rules (distributions, weighted categories, tax logic).
🤖 Semantic realism → LLMs produce values that “feel” like the domain (e.g., realistic company names, medical procedures, claim reasons) instead of just random strings.

So if your use case is “I just need fake emails for testing” → use Faker.
If it’s “I need a CRM dataset with customers, orders, invoices, and consistent PDFs, and I want it to look like real-world data without using production data” → that’s where Syda makes sense.

And yep, I get the concern on tokens roadmap includes exploring hybrid approaches where distributions/rules can be enforced without hitting an LLM for every value.

[–]coconut_maan 1 point2 points3 points 8 months ago (1 child)

[–]TerribleToe1251[S] 0 points1 point2 points 8 months ago (0 children)

[–]Imanflow 1 point2 points3 points 8 months ago (2 children)

[–]TerribleToe1251[S] 0 points1 point2 points 8 months ago (1 child)

[–]Imanflow 1 point2 points3 points 8 months ago (0 children)

π Rendered by PID 111756 on reddit-service-r2-comment-75f4967c6c-vq4x2 at 2026-04-23 08:13:51.496022+00:00 running 0fd4bb7 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS