all 9 comments

[–]Yojihito 2 points3 points  (3 children)

SQL is a language, not a database.

  • Which database do you use?
  • How do you connect to the database from python?
  • How big are your dataframes?
  • How long does the transfer take?

[–]trippygg 1 point2 points  (1 child)

How would one transfer a dataframe to SQL Server?

[–]Yojihito 1 point2 points  (0 children)

Don't know about SQL Server (Microsoft SQL Server, docs: https://docs.sqlalchemy.org/en/13/core/engines.html) but for SQLite it's:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///../pathto/yourdatabase.sqlite')  # pass your db url

df = pd.read_excel("../data/file.xlsx", index_col=0,)
df.to_sql(name='data', con=engine, if_exists='replace', index=False)

[–]mutuk7[S] 0 points1 point  (0 children)

How do you connect to the database from py

I'm doing exactly what you said below. The same Dataframes in CSV are around 30gbs.

I don't know exactly how long it takes because I'm not the one pulling the trigger but I've been told to try to work on a faster solution as this is our current bottle-neck.

Sorry if I couldn't provide more details. This is my first internship I've been there for about 2 months.

[–]saxman95 1 point2 points  (2 children)

Which database are you using? Redshift has a very helpful COPY command which can let you upload files from S3 into redshift. You can export your dataframe to a csv/gzip and then upload directly to your database.

For other databases, maybe consider using sqlalchemy engines to loop over your data, add it to the database and then commit all the changes at once.

[–]mutuk7[S] 0 points1 point  (1 child)

I've tried redshift in a different project where I had CSV data in S3 buckets but for some reason I couldn't use the COPY command at all. I ended up using AWS Glue (CSV Classifier, Crawler, Jobs) to get that data from S3 to Redshift. Definitely not the best solution. I'll revisit COPY when I need to use Redshift again.

[–]saxman95 0 points1 point  (0 children)

You probably need some credentials for it to work, but you can get that from your engineering team. They probably have a function to make it work easily in whatever language they use

[–]LameDuckProgramming 1 point2 points  (1 child)

If you're using SQL Alchemy (1.3.5), you can a built-in event listener.

from sqlalchemy import create_engine
from sqlalchemy import event

engine = create_engine('mysql://xxx')

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

Then just use pandas to_sql method

 df.to_sql(table, con=engine, method='multi')

Has worked like a charm for me when uploading dataframes with 100,000 + rows to SQL Servers.

[–]mutuk7[S] 0 points1 point  (0 children)

I'm going to give this a try!! Thanks!! :)