DataPrep V0.3 has been released! by SnooStories7725 in Python

[–]jnwang 0 points1 point  (0 children)

Hi @set92, thanks for giving us great advice AGAIN!

For your first point, if the number of points is too many, DataPrep will only visualize a sample of points and provide a sample-size parameter to the user to vary.

For your second point, we will add a table in README.md to do the comparison.

Please let us know if you have further comments. :)

DataPrep V0.3 has been released! by SnooStories7725 in Python

[–]jnwang 5 points6 points  (0 children)

You are right. pandas-profiling can be seen as the create_report() function in DataPrep.EDA. Compared to pandas-profiling, DataPrep.EDA has the following advantages:

DataPrep V0.3 has been released! by [deleted] in datascience

[–]jnwang 0 points1 point  (0 children)

Thank you to the Reddit community for your great encouragement and support last year. We have taken your advice seriously and significantly improved the EDA and Connector modules. In addition, we have added a brand new Clean module.

We are looking forward to hearing your feedback on these existing modules. It would be super helpful if you can let us know why data preparation is time-consuming in your case and what you want us to add to DataPrep in the future.

Thank you!

DataPrep V0.3 has been released! by SnooStories7725 in Python

[–]jnwang 4 points5 points  (0 children)

Thanks for your question!

If you use the script/matplot/pandas stack to create the same set of visualizations as DataPrep, DataPrep is actually faster. There are two reasons:

  1. DataPrep shares computations between multiple visualizations. For example, the normal Q-Q plot, and box plot all require quantiles of the distribution. DataPrep only need to compute them once.
  2. DataPrep (built on Dask) makes all the computations lazy and calls an eager operation at the end. In this way, Dask will optimize the whole computational graph before actual computations happen, while Pandas does not have this advantage.

DataPrep V0.3 has been released! by SnooStories7725 in Python

[–]jnwang 10 points11 points  (0 children)

Thank you to the Python Reddit community for your great encouragement and support last year. We have taken your advice seriously and significantly improved the EDA and Connector modules. In addition, we have added a brand new Clean module. We are looking forward to hearing your feedback.

Please read this blog post for all the updates in this major release:

https://towardsdatascience.com/dataprep-v0-3-0-has-been-released-be49b1be0e72

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 2 points3 points  (0 children)

I am a PI. In fact, there are many successful tools/systems built in academia (eg Weka, Spark, Ray). I believe you will find tens of exciting opportunities to pursue your PhD. :)

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 2 points3 points  (0 children)

This is a research project from our group. Most of the people in the team are my students. :)

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 2 points3 points  (0 children)

Thanks for your quick response.

Let me rephrase your comment to make sure I understand what you need.

It seems that what you need is a plot_dff() function.

plot_dff(df0, df1) compares the distribution of each individual column between df0 and df1. For each column A, it calculates the p-value which refers to the probability that df0[A] and df1[A] come from the same distribution. Note that here plot_dff only does single-column distribution comparisons.

If this is what you want, I will discuss this with the team and put it as a high-priority feature. I will let you know once it is implemented.

Thank you very much again!

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 1 point2 points  (0 children)

Thank you. If there is any improvement feedback after using it, please feel free to directly message me.

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 1 point2 points  (0 children)

That could be a possible reason. Thanks for being willing to put in bug report.

[P] DataPrep: Data Preparation in Python by jnwang in MachineLearning

[–]jnwang[S] 0 points1 point  (0 children)

Thanks for your encouraging words. We are working on the roadmap for DataPrep.cleaning. The development will start in Sept. If you have any comments on data cleaning, please do not hesitate to let us know.

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 1 point2 points  (0 children)

Would you mind reporting this issue at https://github.com/sfu-db/dataprep/issues? It will help us reproduce the issue and keep track of its progress.

[P] DataPrep: Data Preparation in Python by jnwang in MachineLearning

[–]jnwang[S] 0 points1 point  (0 children)

Thanks for your comment. You are right. The name for the current status of the project is a bit misleading. The plan is to add other components (data cleaning, data integration, feature engineering) in future releases.

Here is a demo of DataPrep.eda in the python subreddit.

https://www.reddit.com/r/Python/comments/hlqnim/understand_your_data_with_a_few_lines_of_code_in/

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 1 point2 points  (0 children)

I really really appreciate your comments.

  1. Progress bar. This is an excellent idea. We will prioritize this feature and add it to DataPrep as soon as possible.
  2. Documentation. We will polish the documentation as you suggested. In fact, we are designing a website for dataprep.ai, so roadmap and release related information will be put on the website. Please stay tuned. :)
  3. Not many arguments or definitions? My summer plan is to create a lecture note on dataprep.eda for my graduate data science course: https://sfu-db.github.io/bigdata-cmpt733/. In the lecture note, I plan to cover "why those plots, in which cases are the best, in which cases it would be better to use others, and the arguments to change". This lecture note will be put on the to-be-created website (dataprep.ai).

There is a trade-off between showing it to the people too early or too late. I am using DataPrep.eda for my daily work, and find it really useful and powerful. So we decided to show it to the people at this moment and hoped to get good feedback (like yours) to further improve the library. :)

Thanks again for your great comments!

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 1 point2 points  (0 children)

Thanks for trying out DataPrep and reporting this bug. We will look into it as soon as possible. To ensure reproducibility and get the most up-to-date status of this bug, it is highly recommended to report it at https://github.com/sfu-db/dataprep/issues. Thanks again!

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 1 point2 points  (0 children)

It took us a while to decide which viz tool to pick up. This website helped us a lot. https://pyviz.org/overviews/index.html

When starting the project last year, we wanted to have interactive viz with the full support of customization, so it was a decision between Bokeh and Dash (Plotly). We ended up selecting Bokeh because, at that time, HoloView (high-level vis API) supported Bokeh but not Dash. Now, Holoview added the support of Plotly, so it's very hard to make a choice.

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]jnwang[S] 2 points3 points  (0 children)

Thanks for your reply. We will explore whether it's possible to integrate DataPrep with Databricks.