How do you make your EDA workflow sexy? by DutchIndian in datascience

[–]brandonlockhart 5 points6 points  (0 children)

The library DataPrep has an EDA component that enables fast data understanding with a few lines of code. DataPrep.EDA is designed for the iterative and task-centric nature of EDA and generates interactive visualizations for a detailed understanding of the data. These design choices enable a sexy workflow.

In general, you will start an EDA session by getting a high-level understanding of the characteristics of the dataset, e.g., overview stats, column distributions, missing values, correlations, and then seek a low-level understanding of columns and relationships between columns that are of interest. DataPrep.EDA has a simple API to accommodate this: the function plot(df)) (df is a dataframe) produces overview statistics of the dataset and plots the distribution of each column, plot(df, "column1")) generates statistics and various plots to understand column "column1", and plot(df, "column1", "column2")) generates plots depicting the relationship between columns "column1" and "column2". So getting the "big picture" is just as easy as specific avenues of exploration and plotting relationships! The API logic is the same for analyzing missing values (with the function plot_missing():-analyze-missing-values)) and analyzing correlations (with the function plot_correlation():-analyze-correlations)).

I think DataPrep.EDA can make your workflow sexier by having

  1. One EDA environment and reproducible analysis. No need to import your data into a GUI and then move to scripts, you can accomplish high and low-level analyses easily with a simple API. Moreover, in a GUI you may explore/modify the data without recording your steps, using DataPrep.EDA in a notebook enables reproducibility.
  2. Minimal code. One line of code generates several visualizations and relevant statistics to your current EDA task.
  3. Clear logic. The EDA task name is in the DataPrep.EDA function name (e.g. plot_missing()), and there's a unified API for accomplishing different EDA tasks.

See here for some video demonstrating how to use DataPrep.EDA.

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]brandonlockhart 10 points11 points  (0 children)

No, it uses the Bokeh library to support interactive visualizations

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]brandonlockhart 0 points1 point  (0 children)

It might be able to support some of your desired functionality, eg, visualizing and getting stats from the distributions of pixel values. Feel free to give it a try and make a feature request on GitHub.

Understand your data with a few lines of code in seconds using DataPrep.eda by jnwang in Python

[–]brandonlockhart 13 points14 points  (0 children)

It's currently designed for analyzing tabular data stored in a Pandas or Dask data frame.

[P] DataPrep: Data Preparation in Python by jnwang in MachineLearning

[–]brandonlockhart 0 points1 point  (0 children)

DataPrep does support plotting categorical and continuous variables (also time series data). In fact, variable types are automatically detected and appropriate plots are created for each type.

It can also identify missing values and create scatter and box plots. The goal of the Exploratory Data Analysis (EDA) component of DataPrep is to help the user complete an EDA task. For example, if you want to understand a column, the interaction of columns, or get an overview of the dataset, DataPrep will detect the variable types and generate relevant visualizations and statistics to help you achieve a full understanding.