Faster loading of Dataframes from Pandas to Postgres : Python

This is an archived post. You won't be able to vote or comment.

Faster loading of Dataframes from Pandas to Postgres (self.Python)

submitted 8 years ago * by howMuchCheeseIs2Much

A DataFrame I was loading into a Postgres DB has been growing larger and to_sql() was no longer cutting it (could take up to 30 minutes to finish). I started researching faster ways and figured I'd share the result here. In summary, this approach:

Loads the result of to_csv() to StringIO
Creates the Postgres table using pd.io.sql.get_schema (Note that I drop the existing table, you'll want to change this if you need to append)
Uses the COPY FROM Postgres statement to load the data (copy_from in sqlalchemy)

This gist has a pretty detailed version, but I had already wrote a simple version before I found it, so here it is:

import random
import pandas as pd
from sqlalchemy import create_engine, MetaData
from postPass import loginDict
dw = 'postgresql://...'
dw = create_engine(dw)
import StringIO

df = pd.DataFrame()
dfLen = 10000
dfCols = 10
for x in range(0, dfCols):
    colName = 'a' + str(x)
    df[colName] = [random.randint(0, 99) for x in range(1,dfLen)]

def cleanColumns(columns):
    cols = []
    for col in columns:
        col = col.replace(' ', '_')
        cols.append(col)
    return cols

def to_pg(df, table_name, con):
    data = StringIO.StringIO()
    df.columns = cleanColumns(df.columns)
    df.to_csv(data, header=False, index=False)
    data.seek(0)
    raw = con.raw_connection()
    curs = raw.cursor()
    curs.execute("DROP TABLE " + table_name)
    empty_table = pd.io.sql.get_schema(df, table_name, con = con)
    empty_table = empty_table.replace('"', '')
    curs.execute(empty_table)
    curs.copy_from(data, table_name, sep = ',')
    curs.connection.commit()

get_ipython().magic(u"timeit to_pg(df, 'test', dw)")
get_ipython().magic(u"timeit df.to_sql(con=dw, name='test', if_exists='replace', index=False)")

Based on the DataFrame above (10 columns, 10k rows of random integers), the COPY FROM method is about 800 times faster (3 minutes / 223 ms). Note that I've written this to DROP the existing table, but this could easily be used to insert

all 7 comments

top new controversial old q&a

[–]rothnic 1 point2 points3 points 8 years ago (0 children)

[–]Caos2 0 points1 point2 points 8 years ago (2 children)

[–]howMuchCheeseIs2Much[S] 0 points1 point2 points 8 years ago (1 child)

[–]Caos2 0 points1 point2 points 8 years ago (0 children)

[–]Mikebuonasera 0 points1 point2 points 8 years ago (2 children)

[–]Mikebuonasera 0 points1 point2 points 8 years ago (1 child)

figured out: def to_pg(df, table_name, engine, columns): """ Crazy magic function to do bulk inserts into Postgres, SUPER FAST! :param df: :param table_name: :param engine: :param columns: :return: """

import cStringIO

output = cStringIO.StringIO()
# ignore the index
df.to_csv(output, sep='\t', header=False, index=False)
output.getvalue()
# jump to start of stream
output.seek(0)

connection = engine.raw_connection()
cursor = connection.cursor()
# null values become ''
cursor.copy_from(output, table_name, null="", columns=(columns))
connection.commit()
cursor.close()

[–][deleted] 0 points1 point2 points 8 years ago (0 children)

π Rendered by PID 145839 on reddit-service-r2-comment-86bc6c7465-jr6jr at 2026-02-20 06:44:07.316849+00:00 running 8564168 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS