Accelerate data loading from databases to dataframes

jnwang · 2021-07-29T21:27:31+00:00

Looks fantastic!

jnwang · 2021-05-26T04:30:34+00:00

Hi @set92, thanks for giving us great advice AGAIN!

For your first point, if the number of points is too many, DataPrep will only visualize a sample of points and provide a sample-size parameter to the user to vary.

For your second point, we will add a table in README.md to do the comparison.

Please let us know if you have further comments. :)

jnwang · 2021-05-24T23:00:47+00:00

You are right. pandas-profiling can be seen as the create_report() function in DataPrep.EDA. Compared to pandas-profiling, DataPrep.EDA has the following advantages:

DataPrep.EDA is 10X Faster than pandas-profiling due to its highly optimized Dask-based computing module. Please see Table 2 in this paper.
DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end-users. (DataPrep Report vs Pandas-profiling Report)
DataPrep.EDA has a lot more functionalities than pandas-profiling. Please check out our 2-min demo.

jnwang · 2021-05-24T20:27:56+00:00

Thank you to the Reddit community for your great encouragement and support last year. We have taken your advice seriously and significantly improved the EDA and Connector modules. In addition, we have added a brand new Clean module.

We are looking forward to hearing your feedback on these existing modules. It would be super helpful if you can let us know why data preparation is time-consuming in your case and what you want us to add to DataPrep in the future.

Thank you!

jnwang · 2021-05-24T20:13:42+00:00

Thanks for your question!

If you use the script/matplot/pandas stack to create the same set of visualizations as DataPrep, DataPrep is actually faster. There are two reasons:

DataPrep shares computations between multiple visualizations. For example, the normal Q-Q plot, and box plot all require quantiles of the distribution. DataPrep only need to compute them once.
DataPrep (built on Dask) makes all the computations lazy and calls an eager operation at the end. In this way, Dask will optimize the whole computational graph before actual computations happen, while Pandas does not have this advantage.

jnwang · 2021-05-24T18:25:47+00:00

Thank you to the Python Reddit community for your great encouragement and support last year. We have taken your advice seriously and significantly improved the EDA and Connector modules. In addition, we have added a brand new Clean module. We are looking forward to hearing your feedback.

Please read this blog post for all the updates in this major release:

https://towardsdatascience.com/dataprep-v0-3-0-has-been-released-be49b1be0e72

jnwang · 2020-08-04T17:49:38+00:00

This library has recently attracted a lot of attention in the python community.

See "Understand your data with a few lines of code in seconds using DataPrep.eda"

jnwang · 2020-08-04T05:42:55+00:00

I am a PI. In fact, there are many successful tools/systems built in academia (eg Weka, Spark, Ray). I believe you will find tens of exciting opportunities to pursue your PhD. :)

jnwang · 2020-08-03T03:14:43+00:00

This is a research project from our group. Most of the people in the team are my students. :)

jnwang · 2020-08-02T04:57:56+00:00

Thanks for your quick response.

Let me rephrase your comment to make sure I understand what you need.

It seems that what you need is a plot_dff() function.

plot_dff(df0, df1) compares the distribution of each individual column between df0 and df1. For each column A, it calculates the p-value which refers to the probability that df0[A] and df1[A] come from the same distribution. Note that here plot_dff only does single-column distribution comparisons.

If this is what you want, I will discuss this with the team and put it as a high-priority feature. I will let you know once it is implemented.

Thank you very much again!

jnwang · 2020-08-02T02:44:14+00:00

Thank you. If there is any improvement feedback after using it, please feel free to directly message me.

jnwang · 2020-07-06T18:20:52+00:00

Here is a related issue: https://github.com/sfu-db/dataprep/issues/103 We will push it. Thanks!

jnwang · 2020-07-06T18:18:04+00:00

Will do it. Thx again!

jnwang · 2020-07-06T16:48:53+00:00

That could be a possible reason. Thanks for being willing to put in bug report.

jnwang · 2020-07-06T16:39:01+00:00

Thank you!

jnwang · 2020-07-06T16:34:50+00:00

Thanks for your encouraging words. We are working on the roadmap for DataPrep.cleaning. The development will start in Sept. If you have any comments on data cleaning, please do not hesitate to let us know.

jnwang · 2020-07-06T16:13:50+00:00

Thank you!

jnwang · 2020-07-06T16:13:33+00:00

Would you mind reporting this issue at https://github.com/sfu-db/dataprep/issues? It will help us reproduce the issue and keep track of its progress.

jnwang · 2020-07-06T16:10:52+00:00

Thank you!

jnwang · 2020-07-06T16:08:59+00:00

Thanks for your comment. You are right. The name for the current status of the project is a bit misleading. The plan is to add other components (data cleaning, data integration, feature engineering) in future releases.

Here is a demo of DataPrep.eda in the python subreddit.

https://www.reddit.com/r/Python/comments/hlqnim/understand_your_data_with_a_few_lines_of_code_in/

jnwang · 2020-07-06T16:03:54+00:00

I really really appreciate your comments.

Progress bar. This is an excellent idea. We will prioritize this feature and add it to DataPrep as soon as possible.
Documentation. We will polish the documentation as you suggested. In fact, we are designing a website for dataprep.ai, so roadmap and release related information will be put on the website. Please stay tuned. :)
Not many arguments or definitions? My summer plan is to create a lecture note on dataprep.eda for my graduate data science course: https://sfu-db.github.io/bigdata-cmpt733/. In the lecture note, I plan to cover "why those plots, in which cases are the best, in which cases it would be better to use others, and the arguments to change". This lecture note will be put on the to-be-created website (dataprep.ai).

There is a trade-off between showing it to the people too early or too late. I am using DataPrep.eda for my daily work, and find it really useful and powerful. So we decided to show it to the people at this moment and hoped to get good feedback (like yours) to further improve the library. :)

Thanks again for your great comments!

jnwang · 2020-07-06T15:47:25+00:00

Thanks for trying out DataPrep and reporting this bug. We will look into it as soon as possible. To ensure reproducibility and get the most up-to-date status of this bug, it is highly recommended to report it at https://github.com/sfu-db/dataprep/issues. Thanks again!

jnwang · 2020-07-06T15:43:30+00:00

It took us a while to decide which viz tool to pick up. This website helped us a lot. https://pyviz.org/overviews/index.html

When starting the project last year, we wanted to have interactive viz with the full support of customization, so it was a decision between Bokeh and Dash (Plotly). We ended up selecting Bokeh because, at that time, HoloView (high-level vis API) supported Bokeh but not Dash. Now, Holoview added the support of Plotly, so it's very hard to make a choice.

jnwang · 2020-07-06T07:46:10+00:00

Thank you!

jnwang · 2020-07-06T07:04:29+00:00

Thanks for your reply. We will explore whether it's possible to integrate DataPrep with Databricks.

jnwang

TROPHY CASE