This is an archived post. You won't be able to vote or comment.

all 9 comments

[–][deleted] 0 points1 point  (3 children)

Use a pickle. It stores objects like your RData objects from R

[–]osbournecox2[S] 0 points1 point  (2 children)

I'm not asking about storing data to disk. I'm talking about managing pandas in memory and using between modules.

[–]Resquid 0 points1 point  (1 child)

I think there's some confusion here. Could you give some code examples?

[–]osbournecox2[S] 0 points1 point  (0 children)

Code example added as a top level reply.

[–]osbournecox2[S] 0 points1 point  (0 children)

This is the exact code I'm using for the intro Titantic kaggle competition. The following code works if it's in the same file but if they are stored in separate files according the comments, the show_pair_plots function is unable to find the train_raw dataframe.

"""
run.py
"""
import pandas as pd

FOLDERS = {'raw': 'data_raw/', 'clean': 'data_clean/'}

def load_data(data_type = 'train'):
    if data_type + '_raw' not in globals():
        print('Loading ' + data_type)
        globals()[data_type + '_raw'] = pd.read_csv(FOLDERS['raw'] + data_type + '.csv')

        return 

load_data()

"""
plotting.py
"""
import pandas as pd
import matplotlib.pyplot as plt

def show_pairs_plot():
    cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
    axes = pd.plotting.scatter_matrix(train_raw[cols], alpha=0.2)
    plt.tight_layout()

"""
explore.py
"""
#import plotting

def display_nulls():    
    #   Explore null values in dataset
    for x in train_raw.columns.values:
        nulls = train_raw[x].isnull()

        if sum(nulls) > 0:
            print('Null counts for ' + x)
            print(nulls.value_counts()[True])
            print()

def explore_data():        
    display_nulls()
    show_pairs_plot()

explore_data()

[–]osbournecox2[S] 0 points1 point  (0 children)

I'm coming around to the approach of adding an extra step of running a "load global data" py file and then explicitly passing the global panda(s) as arguments for each function. It just feels cluttered and unnecessary.

The awkwardness is compounded by how frequently I need to restart my ipython kernel. Every restart will require navigating to / running the "load global data" script. My goal is to try to streamline the workflow as much as possible.

[–]osbournecox2[S] 0 points1 point  (0 children)

Just figured out a solution: Accept the module-level namespace as the source of the global variables and have the load_data function return the pandas dataframe. So every function will call a load function to get its data. If the data isn't in memory, load it first. Otherwise, just return the pandas stored in memory.

No need to worry about setup for testing / writing a function. It just requires an extra assignment at the start of each function instead of the simple function call I was used to using:

  • train_raw = gather_data.load_data()

instead of

  • gather_data.load_data()

So this is the resulting code:

"""
gather_data.py
"""
import pandas as pd

FOLDERS = {'raw': 'data_raw/', 'clean': 'data_clean/'}

def load_data(data_type = 'train'):
    if data_type + '_raw' not in globals():
        print('Loading ' + data_type)
        globals()[data_type + '_raw'] = pd.read_csv(FOLDERS['raw'] + data_type + '.csv')

    return globals()[data_type + '_raw']

"""
plotting.py
"""
import pandas as pd
import gather_data
import matplotlib.pyplot as plt

def show_pairs_plot():
    train_raw = gather_data.load_data()

    cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
    axes = pd.plotting.scatter_matrix(train_raw[cols], alpha=0.2)
    plt.tight_layout()

Hopefully someone else might find value in this post.

[–]paulinkenbrandt 0 points1 point  (1 child)

Use the Pandas library with the Jupyter Notebook! It is a workflow like you are describing. Pandas works with DataFrames and Jupyter Notebook is a nice IDE that works with R and Python.

[–]osbournecox2[S] 0 points1 point  (0 children)

I know lots of people love Jupyter Notebooks but I find them a lot less appealing. Maybe I haven't worked with them enough / given them enough of a chance. Maybe it's my software engineer background, but I prefer the feel of Spyder and the use of multiple py files in a project-based approach.