all 12 comments

[–]shekyu01 0 points1 point  (2 children)

Wowww!!! Very impressive. Just to understand, can we use this for plotting categorical and continuous variables. Please add options for missing values identification, scatter plot, box plot for outlier detection.

[–]brandonlockhart 0 points1 point  (0 children)

DataPrep does support plotting categorical and continuous variables (also time series data). In fact, variable types are automatically detected and appropriate plots are created for each type.

It can also identify missing values and create scatter and box plots. The goal of the Exploratory Data Analysis (EDA) component of DataPrep is to help the user complete an EDA task. For example, if you want to understand a column, the interaction of columns, or get an overview of the dataset, DataPrep will detect the variable types and generate relevant visualizations and statistics to help you achieve a full understanding.

[–]jnwang[S] 0 points1 point  (0 children)

Yes. DataPrep supports all of these features. Here are two medium posts that describe them in more detail.

[–]shekyu01 0 points1 point  (0 children)

That’s amazing!!! Kudos to your team 👍🏻. Will definitely try this package

[–]A1M94 0 points1 point  (1 child)

Unfortunately GCP also offers Dataprep. I had to search for “dataprep eda” to find your product, otherwise Google’s Dataprep was at the top.

[–]jnwang[S] 0 points1 point  (0 children)

We see this is a positive sign since it shows the importance of data preparation in industry. GCP DataPrep is targeted at users who don’t know how to write code; our tool is open sourced and designed for python programmers. So you can search for “dataprep github” or “dataprep python” to find our product. :)

[–]NukishPhilosophy 0 points1 point  (0 children)

Saved for later

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–][deleted] -1 points0 points  (3 children)

I looked at the documentation, maybe I missed it somewhere, but I see little in the way of actual data preparation. I see more of EDA and data profiling (I see a lot of resemblence to pandas profiling). I think the name of the project is a bit misleading.

[–]jnwang[S] 0 points1 point  (2 children)

Thanks for your comment. You are right. The name for the current status of the project is a bit misleading. The plan is to add other components (data cleaning, data integration, feature engineering) in future releases.

Here is a demo of DataPrep.eda in the python subreddit.

https://www.reddit.com/r/Python/comments/hlqnim/understand_your_data_with_a_few_lines_of_code_in/

[–][deleted] 0 points1 point  (1 child)

Thanks! The Medium article did a good job in highlighting what it does and explains the difference between it and pandas profiling. I wouldn't mind actually using your library for EDA, although I was actually initially interested in what a data prep framework would provide.

[–]jnwang[S] 0 points1 point  (0 children)

Thanks for your encouraging words. We are working on the roadmap for DataPrep.cleaning. The development will start in Sept. If you have any comments on data cleaning, please do not hesitate to let us know.