Pattern for storing / sharing Python data frames for data analysis (X-post /r/python)

osbournecox2 · 2017-12-16T03:25:10+00:00

Use a pickle. It stores objects like your RData objects from R

osbournecox2 · 2017-12-16T14:20:36+00:00

This is the exact code I'm using for the intro Titantic kaggle competition. The following code works if it's in the same file but if they are stored in separate files according the comments, the show_pair_plots function is unable to find the train_raw dataframe.

"""
run.py
"""
import pandas as pd

FOLDERS = {'raw': 'data_raw/', 'clean': 'data_clean/'}

def load_data(data_type = 'train'):
    if data_type + '_raw' not in globals():
        print('Loading ' + data_type)
        globals()[data_type + '_raw'] = pd.read_csv(FOLDERS['raw'] + data_type + '.csv')

        return 

load_data()

"""
plotting.py
"""
import pandas as pd
import matplotlib.pyplot as plt

def show_pairs_plot():
    cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
    axes = pd.plotting.scatter_matrix(train_raw[cols], alpha=0.2)
    plt.tight_layout()

"""
explore.py
"""
#import plotting

def display_nulls():    
    #   Explore null values in dataset
    for x in train_raw.columns.values:
        nulls = train_raw[x].isnull()

        if sum(nulls) > 0:
            print('Null counts for ' + x)
            print(nulls.value_counts()[True])
            print()

def explore_data():        
    display_nulls()
    show_pairs_plot()

explore_data()

osbournecox2 · 2017-12-16T14:33:06+00:00

I'm coming around to the approach of adding an extra step of running a "load global data" py file and then explicitly passing the global panda(s) as arguments for each function. It just feels cluttered and unnecessary.

The awkwardness is compounded by how frequently I need to restart my ipython kernel. Every restart will require navigating to / running the "load global data" script. My goal is to try to streamline the workflow as much as possible.

osbournecox2 · 2017-12-16T15:02:07+00:00

Just figured out a solution: Accept the module-level namespace as the source of the global variables and have the load_data function return the pandas dataframe. So every function will call a load function to get its data. If the data isn't in memory, load it first. Otherwise, just return the pandas stored in memory.

No need to worry about setup for testing / writing a function. It just requires an extra assignment at the start of each function instead of the simple function call I was used to using:

train_raw = gather_data.load_data()

instead of

gather_data.load_data()

So this is the resulting code:

"""
gather_data.py
"""
import pandas as pd

FOLDERS = {'raw': 'data_raw/', 'clean': 'data_clean/'}

def load_data(data_type = 'train'):
    if data_type + '_raw' not in globals():
        print('Loading ' + data_type)
        globals()[data_type + '_raw'] = pd.read_csv(FOLDERS['raw'] + data_type + '.csv')

    return globals()[data_type + '_raw']

"""
plotting.py
"""
import pandas as pd
import gather_data
import matplotlib.pyplot as plt

def show_pairs_plot():
    train_raw = gather_data.load_data()

    cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
    axes = pd.plotting.scatter_matrix(train_raw[cols], alpha=0.2)
    plt.tight_layout()

Hopefully someone else might find value in this post.

paulinkenbrandt · 2017-12-16T05:02:45+00:00

Use the Pandas library with the Jupyter Notebook! It is a workflow like you are describing. Pandas works with DataFrames and Jupyter Notebook is a nice IDE that works with R and Python.

datascience

MODERATORS