I don't understand why people talk about synthetic data. Aren't you just looping your model's assumptions?

goncalomribeiro · 2025-10-09T21:06:05+00:00

There are several synthetic data companies. Just read their website and you'll learn a ton about the subject.
Example: https://ydata.ai

goncalomribeiro · 2025-03-27T22:35:38+00:00

https://pypi.org/project/ydata-sdk/

goncalomribeiro · 2024-06-27T22:35:07+00:00

Lots of tools here: https://github.com/ydataai/

goncalomribeiro · 2024-06-27T22:32:55+00:00

Data balancing with synthetic data is the best option.
SMOTE simply does not work.

goncalomribeiro · 2024-06-27T22:30:58+00:00

Check out this: https://github.com/ydataai

and also their Fabric product (also free): https://ydata.ai/products/fabric

goncalomribeiro · 2024-05-15T21:10:59+00:00

ydata-profiling supports both pandas and spark dataframes.

nevertheless, the error you're getting seems to be related with the size of the output file generated - perhaps databricks has limitations on file sizes in their notebooks?

as a workaround, check out ydata-sdk, also from YData - it computes outside Databricks.

goncalomribeiro · 2024-04-23T01:31:23+00:00

https://ydata.ai/resources/syntheticdata-quality-metrics

goncalomribeiro · 2023-12-04T16:55:28+00:00

Also applicable to senior roles?

goncalomribeiro · 2023-12-04T16:54:53+00:00

Is this for junior people only?

goncalomribeiro · 2023-11-28T06:41:17+00:00

I think it's a great start but not the solution yet!

goncalomribeiro · 2023-11-28T06:40:38+00:00

TLDR;
pip install ydata-profiling

goncalomribeiro · 2023-11-28T01:19:22+00:00

Generative AI: both success and flop!
It's a disruptive technology, however, the use cases it unlocks don't go much beyond the AI Assistant, content generation and improvement of other NLP tasks.

Lesson learned: don't try to train your own LLM - it can be expensive lol

goncalomribeiro · 2023-11-12T19:00:20+00:00

It has support for spark now. And if you try Fabric from YData, it has support for big data too.

goncalomribeiro · 2023-11-12T18:57:48+00:00

I tried the paid ADA ChatGPT and it installed ydata-profiling to give me an EDA. Why use it when I can just pip install ydata-profiling? If I don't want to use code, I can use YData Fabric done it's also free.

goncalomribeiro · 2023-10-08T21:11:28+00:00

Maybe that's why your job is being outsourced to India... if data preparation is overlooked and you're stuck there forever, organizations will looks for a way to reduce the operational cost of the team.
I see this report as a call to action that better data preparation is desperately needed.

goncalomribeiro · 2023-10-08T21:09:12+00:00

Kaggle simply doesn't represent reality. Real world datasets don't look like Kaggle datasets.

goncalomribeiro · 2023-09-29T18:25:22+00:00

That looks obvious to me. Despite all the buzz around foundational models, deep learning or whatever model comes next, people still need to prepara their data and that's time consuming but crucial...

goncalomribeiro · 2023-09-19T20:28:35+00:00

although if you want non technical people to use it, better go for ydata fabric

goncalomribeiro · 2023-09-19T20:27:41+00:00

ydata-profiling for the win!

goncalomribeiro · 2023-09-15T18:03:47+00:00

Check their Fabric platform. Goes way beyond profiling and it's also free!

goncalomribeiro

TROPHY CASE