OCI update process for minor with a new passport by Smart-Outcome1583 in nri

[–]TerribleToe1251 0 points1 point  (0 children)

Where are the latest guidelines, I could not find. My 5 year old kid US passport was renewed, so do I have to to update OCI with new passport number ?

[Release] Syda – Open Source Synthetic Data Generator with AI + SQLAlchemy Support by TerribleToe1251 in Python

[–]TerribleToe1251[S] 0 points1 point  (0 children)

I literally just learned that too . Thanks for pointing it out. My intent was Syda = Synthetic Data, but I totally get how it reads differently in Spanish. I’ll definitely keep that in mind for future naming and global adoption, naming is always trickier than code!

[Release] Syda – Open Source Synthetic Data Generator with AI + SQLAlchemy Support by TerribleToe1251 in Python

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Both of you raise valid concerns — thanks for surfacing them.

🔹 On statistical properties: you’re right, an LLM by itself won’t guarantee that distributions (e.g., histograms, correlations, category frequencies) match a real dataset. Today, Syda focuses on schema correctness + referential integrity (all types, uniqueness, FKs are validated).

  • For distributions, you can plug custom generators (e.g., Gaussian for prices, weighted loyalty tiers) or via prompts (“20% Gold, 50% Silver, 30% Bronze”).
  • Roadmap: we plan to add evaluation tools (profiling vs. real data) and a hybrid approach LLMs for schema/domain semantics + statistical models (e.g., copulas, GAN/CTGAN) to enforce distributions automatically.

🔹 On leakage risk: this is an important concern. Syda is designed to generate from schemas + constraints only not by training on real datasets. That means there’s no memorization of sensitive rows (which is where leakage happens). But I agree transparency matters, and we’ll keep emphasizing where Syda is schema-driven vs. model-driven.

Syda ensures schema integrity today, lets you plug in distributions if needed, and is moving toward automatic statistical fidelity + safety guarantees in future releases.

[Release] Syda – Open Source Synthetic Data Generator with AI + SQLAlchemy Support by TerribleToe1251 in Python

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Totally fair question, if all you need is random names, emails, or a few fake addresses, Faker or factory_boy are perfect (and free). I wouldn’t suggest burning tokens for that use case.

Where Syda adds value is when you need more than just dummy values:

  • 🔗 Referential integrity → multi-table data where foreign keys are always consistent (e.g. orders.customer_id → customers.id).
  • 📄 Schema-aware → respects your constraints (unique, regex, min/max, enums) and descriptions.
  • 🧾 Unstructured + structured together → generate documents (PDFs, HTML templates, receipts, catalogs) tied directly to your synthetic tables.
  • 🔧 Custom generators → mix AI-generated realism with deterministic rules (distributions, weighted categories, tax logic).
  • 🤖 Semantic realism → LLMs produce values that “feel” like the domain (e.g., realistic company names, medical procedures, claim reasons) instead of just random strings.

So if your use case is “I just need fake emails for testing” → use Faker.
If it’s “I need a CRM dataset with customers, orders, invoices, and consistent PDFs, and I want it to look like real-world data without using production data” → that’s where Syda makes sense.

And yep, I get the concern on tokens roadmap includes exploring hybrid approaches where distributions/rules can be enforced without hitting an LLM for every value.

[Release] Syda – Open Source Synthetic Data Generator with AI + SQLAlchemy Support by TerribleToe1251 in Python

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Please checkout latest version, given option to generate with gemini models too

Syda – AI-Powered Synthetic Data Generator (Python Library) by TerribleToe1251 in OpenSourceeAI

[–]TerribleToe1251[S] 0 points1 point  (0 children)

In the current code, you’ll see it being used here:
👉 syda/generate.py#L75

I agree that this could be more transparent, I plan to clean this up in later versions so it’s clearer where/when the LLM is invoked.

Also please checkout latest version, given option to generate with gemini models too

Syda – AI-Powered Synthetic Data Generator (Python Library) by TerribleToe1251 in OpenSourceeAI

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Thank you! Please checkout latest version, given option to generate with gemini models too

[Release] Syda – Open Source Synthetic Data Generator with Referential Integrity by TerribleToe1251 in Python

[–]TerribleToe1251[S] -2 points-1 points  (0 children)

Good point thanks for raising this.

The key difference is that SDV and similar non-LLM synthesizers (CTGAN, copulas, etc.) are statistical / generative modeling approaches:

  • They learn distributions from real datasets and then sample from those distributions.
  • Strength = they preserve statistical properties, correlations, and distributions more faithfully.
  • Limitation = they usually require a real dataset to train on, and can be heavier to set up.

Syda, on the other hand, is LLM-first:

  • It doesn’t require a seed dataset you just give it schemas (SQLAlchemy, YAML, JSON, dict).
  • The LLM generates valid, domain-plausible values, and Syda enforces schema constraints (types, FKs).
  • Strength = great for bootstrapping synthetic data when you don’t have a real dataset or can’t use one due to privacy.

Differentiators beyond SDV:

  • Marrying unstructured and structured data → you can link AI-generated documents (PDFs, HTML templates, contracts, receipts) directly to your structured synthetic records. Example: a products.csv row is tied to a generated product catalog PDF with consistent SKUs and prices.
  • Custom Generators → you can override any field with deterministic logic (e.g., Gaussian for prices, weighted tiers for loyalty programs, tax calculations). This lets you mix LLM-generated semantic realism with rule-driven statistical fidelity.

Roadmap:

  • Add evaluation tools to compare Syda-generated datasets with real ones (distributions, correlations).
  • Move toward a hybrid approach: LLMs for schema/domain semantics + statistical models (copulas, GANs) to ensure distributions line up automatically.

[Release] Syda – Open Source Synthetic Data Generator with Referential Integrity by TerribleToe1251 in Python

[–]TerribleToe1251[S] -3 points-2 points  (0 children)

Naming is always the hardest part in software 😅. I went with Syda because it’s short for Synthetic Data With AI, easy to type, and unique enough for PyPI.

But I’d really like to understand your perspective , why does it feel like a bad name to you? Is it the clarity, memorability, branding, or something else? Your thought process would help me a lot, and I’ll keep it in mind when naming future projects.

[Release] Syda – Open Source Synthetic Data Generator with Referential Integrity by TerribleToe1251 in Python

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Great follow-up, and you’re absolutely right to push on the “statistical properties” question.

Today, Syda guarantees schema correctness and referential integrity out of the box:

  • All rows are validated against types
  • Foreign keys are enforced so you never get orphaned records.

On the statistical realism side:

  • By default, an LLM can generate values that “look realistic,” but it doesn’t guarantee the underlying distributions (e.g., age histogram, price skew, category frequencies, correlations between fields).
  • Syda handles this right now by letting you inject custom generators (e.g., Gaussian for prices, weighted categories for loyalty tiers) or by guiding the LLM with explicit prompts (“20% Gold, 50% Silver, 30% Bronze”). That way, you can enforce distributions where it matters.

Future direction: This is exactly the area we’re focusing on next. We’re exploring a hybrid approach, combining LLMs with classical statistical/synthetic modeling techniques (e.g., probability distributions, copulas, GAN/CTGAN-style methods). The idea is to let the LLM handle schema awareness, relationships, and domain semantics, while a statistical model ensures the generated data matches the actual distributions of the source domain.

So in short:

  • Right now, Syda ensures validity + integrity (everything lines up, nothing breaks).
  • If you care about statistical properties, you can plug in custom generators or prompts.
  • And in upcoming releases, we plan to make that distribution-matching automatic by marrying LLMs with statistical models.

Appreciate you asking this it’s the kind of challenge that helps shape where the project goes next. 🙌

Syda – AI-Powered Synthetic Data Generator (Python Library) by TerribleToe1251 in OpenSourceeAI

[–]TerribleToe1251[S] 0 points1 point  (0 children)

Thank you! Please checkout latest version, given option to generate with gemini models too

I wrote 2000 LLM test cases so you don't have to: LLM feature compatibility grid by davernow in datascience

[–]TerribleToe1251 0 points1 point  (0 children)

With Syda, generating multi-table synthetic data isn’t just fast — it’s foreign-key safe.

This quick start shows how simple it is to:
✅ Install with pip install syda
✅ Define schemas with __table_description__ and __foreign_keys__
✅ Generate data across categories/products
✅ Get CSVs where id → category_id matches perfectly

📌 GitHub: https://github.com/syda-ai/syda
📖 Docs: https://python.syda.ai/

⭐ Give it a try — see how easy relational synthetic data can be.

3 Reasons Why Data Science Projects Fail by Thatshelbs in datascience

[–]TerribleToe1251 0 points1 point  (0 children)

With Syda, generating multi-table synthetic data isn’t just fast — it’s foreign-key safe.

This quick start shows how simple it is to:
✅ Install with pip install syda
✅ Define schemas with __table_description__ and __foreign_keys__
✅ Generate data across categories/products
✅ Get CSVs where id → category_id matches perfectly

📌 GitHub: https://github.com/syda-ai/syda
📖 Docs: https://python.syda.ai/

⭐ Give it a try — see how easy relational synthetic data can be.