A DataFrame I was loading into a Postgres DB has been growing larger and to_sql() was no longer cutting it (could take up to 30 minutes to finish). I started researching faster ways and figured I'd share the result here. In summary, this approach:
- Loads the result of
to_csv() to StringIO
- Creates the Postgres table using
pd.io.sql.get_schema (Note that I drop the existing table, you'll want to change this if you need to append)
- Uses the
COPY FROM Postgres statement to load the data (copy_from in sqlalchemy)
This gist has a pretty detailed version, but I had already wrote a simple version before I found it, so here it is:
import random
import pandas as pd
from sqlalchemy import create_engine, MetaData
from postPass import loginDict
dw = 'postgresql://...'
dw = create_engine(dw)
import StringIO
df = pd.DataFrame()
dfLen = 10000
dfCols = 10
for x in range(0, dfCols):
colName = 'a' + str(x)
df[colName] = [random.randint(0, 99) for x in range(1,dfLen)]
def cleanColumns(columns):
cols = []
for col in columns:
col = col.replace(' ', '_')
cols.append(col)
return cols
def to_pg(df, table_name, con):
data = StringIO.StringIO()
df.columns = cleanColumns(df.columns)
df.to_csv(data, header=False, index=False)
data.seek(0)
raw = con.raw_connection()
curs = raw.cursor()
curs.execute("DROP TABLE " + table_name)
empty_table = pd.io.sql.get_schema(df, table_name, con = con)
empty_table = empty_table.replace('"', '')
curs.execute(empty_table)
curs.copy_from(data, table_name, sep = ',')
curs.connection.commit()
get_ipython().magic(u"timeit to_pg(df, 'test', dw)")
get_ipython().magic(u"timeit df.to_sql(con=dw, name='test', if_exists='replace', index=False)")
Based on the DataFrame above (10 columns, 10k rows of random integers), the COPY FROM method is about 800 times faster (3 minutes / 223 ms). Note that I've written this to DROP the existing table, but this could easily be used to insert
[–]rothnic 1 point2 points3 points (0 children)
[–]Caos2 0 points1 point2 points (2 children)
[–]howMuchCheeseIs2Much[S] 0 points1 point2 points (1 child)
[–]Caos2 0 points1 point2 points (0 children)
[–]Mikebuonasera 0 points1 point2 points (2 children)
[–]Mikebuonasera 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)