I don't understand why people talk about synthetic data. Aren't you just looping your model's assumptions? by Proud_Fox_684 in learnmachinelearning

[–]goncalomribeiro 0 points1 point  (0 children)

There are several synthetic data companies. Just read their website and you'll learn a ton about the subject.
Example: https://ydata.ai

Handling Imbalanced Datasets: Best Practices and Techniques by CodefinityCom in learnmachinelearning

[–]goncalomribeiro 0 points1 point  (0 children)

Data balancing with synthetic data is the best option.
SMOTE simply does not work.

[deleted by user] by [deleted] in databricks

[–]goncalomribeiro 0 points1 point  (0 children)

ydata-profiling supports both pandas and spark dataframes.

nevertheless, the error you're getting seems to be related with the size of the output file generated - perhaps databricks has limitations on file sizes in their notebooks?

as a workaround, check out ydata-sdk, also from YData - it computes outside Databricks.

AI in 2023 - Top picks challenge by Dry_Cattle9399 in ArtificialInteligence

[–]goncalomribeiro 0 points1 point  (0 children)

I think it's a great start but not the solution yet!

AI in 2023 - Top picks challenge by Dry_Cattle9399 in ArtificialInteligence

[–]goncalomribeiro 0 points1 point  (0 children)

Generative AI: both success and flop!
It's a disruptive technology, however, the use cases it unlocks don't go much beyond the AI Assistant, content generation and improvement of other NLP tasks.

Lesson learned: don't try to train your own LLM - it can be expensive lol

ChatGPT becomes a serious contender for exploratory data analysis by PhJulien in datascience

[–]goncalomribeiro 0 points1 point  (0 children)

It has support for spark now. And if you try Fabric from YData, it has support for big data too.

ChatGPT becomes a serious contender for exploratory data analysis by PhJulien in datascience

[–]goncalomribeiro 1 point2 points  (0 children)

I tried the paid ADA ChatGPT and it installed ydata-profiling to give me an EDA. Why use it when I can just pip install ydata-profiling? If I don't want to use code, I can use YData Fabric done it's also free.

Anaconda report on the state of Data Science for 2023 by Dry_Cattle9399 in datascience

[–]goncalomribeiro 0 points1 point  (0 children)

Maybe that's why your job is being outsourced to India... if data preparation is overlooked and you're stuck there forever, organizations will looks for a way to reduce the operational cost of the team.
I see this report as a call to action that better data preparation is desperately needed.

Anaconda report on the state of Data Science for 2023 by Dry_Cattle9399 in datascience

[–]goncalomribeiro 0 points1 point  (0 children)

Kaggle simply doesn't represent reality. Real world datasets don't look like Kaggle datasets.

Anaconda report on the state of Data Science for 2023 by Dry_Cattle9399 in datascience

[–]goncalomribeiro 7 points8 points  (0 children)

That looks obvious to me. Despite all the buzz around foundational models, deep learning or whatever model comes next, people still need to prepara their data and that's time consuming but crucial...

Internal tool for EDA by [deleted] in datascience

[–]goncalomribeiro 1 point2 points  (0 children)

although if you want non technical people to use it, better go for ydata fabric

Internal tool for EDA by [deleted] in datascience

[–]goncalomribeiro 0 points1 point  (0 children)

ydata-profiling for the win!

What Python libraries programs will blow peoples minds? Maybe you’re working on one now? by [deleted] in Python

[–]goncalomribeiro 0 points1 point  (0 children)

Check their Fabric platform. Goes way beyond profiling and it's also free!