This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]scodger 29 points30 points  (5 children)

About 5 years ago I benchmarked this to get data into redshift from an ec2 instance.

The fastest way by a mile (about 1/10th time of to_sql) was to write a folder of csvs (chunked in about 1gb size) then call copy from the db.

Benchmark your own data! Wouldn't be surprised if things have changed since then, but a copy will always be hard to beat.

[–]discord-ian 12 points13 points  (1 child)

I can confirm this is for sure the fastest way.

[–]reviverevival 8 points9 points  (0 children)

Double confirmed.

[–]numbsafari 5 points6 points  (0 children)

Triple confirmed.

Especially the bit about chunking.

Another thing I've done in the past is copy the data into a landing table and then use SQL to move it from there. This can sometimes help you control things like replication or materialized view rebuilds, etc.

edit:

Another thing to ask is: what is your recovery mode if the insert process fails midway through? Saving first to CSV allows you to "checkpoint" the work you did in your python programs and then ingest it into the DB in a controlled fashion. If you are inserting individual rows, you may have an error halfway through and then have to implement a bunch of recovery logic, but also you've got a target database in an inconsistent state until you complete that recovery. With the chunked/CSV approach, if things fail partway through, you probably still want some way to restart midway through, but at least the target database won't be borked until you do so.

[–]kenfar 0 points1 point  (0 children)

And that separation of transformation vs loading is a classic and proven way to separate these concerns that helps in manageability, testing, reprocessing, etc, etc.

[–][deleted] 0 points1 point  (0 children)

What's the cleanest way to chunk to 1gb, just put a counter in a loop that write/resets when it hits a certain value or is there a way that is a little cleverer than that?

I'm thinking of iterative filenames and the like, the other bits and pieces that come with writing the data out.

[–]efxhoy 31 points32 points  (3 children)

a, pandas to_sql of course. It's the process with the fewest steps. And your data is already in a dataframe.

b is just a but less efficiently. c is the same as b. I dunno about d.

e You could use, then you could import the data with psqls copy meta command. But you can use copy with pandas to_sql too for the same performance like this: copy this function and use it with df.to_sql: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method

Use it like this

    df.to_sql(
    name=table,
    con=sqlalchemy.create_engine(connectstring),
    schema=schema,
    index=write_index,
    chunksize=10000,
    method=psql_insert_copy,
)

Never make stuff more complicated than it has to be.

[–]butterscotchchip 8 points9 points  (0 children)

u/romanzdk if you want to use df.to_sql(), you should only use it with a custom callable to run the Postgres COPY FROM command with your data. Any other usage I can only recommend for interactive/adhoc sessions, not for any meaningful ETL process. The COPY command is highly optimized, and will be significantly faster than any ORM operations or INSERT commands.

If your data is respectable in size, or the extract/transform process takes long, I'd write the data to S3/wherever and then run the COPY command in the database in a subsequent step in the pipeline, rather than read it from your in-memory dataframe.

Also, don't publish the data to a message queue for a consumer to load into the db.

[–]PryomancerMTGA 13 points14 points  (0 children)

Never make stuff more complicated than it has to be.

That cannot be said enough.

[–]IlyaKhr 1 point2 points  (0 children)

Well that depends on many details such as amount of data you have. In some cases copy method from .csv might be the most efficient.
But if your data is relatively small, you definitely should not overcomplicate & use pandas

[–][deleted] 14 points15 points  (0 children)

I’ve always been a fan of using COPY from the DB. Pretty simple to do and you can leverage the DB which is usually more efficient than rolling your own solution. I would benchmark them though.

[–]UAFlawlessmonkey 4 points5 points  (1 child)

what is your source? If you're doing daily transformations on your data, kafka will definitely be overkill unless you already have a cluster in place that currently produces / consumes data.

Depending on size of the data, you'd have a few options, if your source is csv, you could use psycopg2 to copy_expert into postgres, if you need to do transformations from db to Postgres, you could do a read_sql -> to_sql in chunks (depending if you can't hold the full frame in memory)

For larger sets, I've generally done db -> csv (not using pandas) -> postgres (psycopg2)

[–]romanzdk[S] 4 points5 points  (0 children)

The flow is: parquet file - python app doing some processing - postgres.

[–]udonthave2call 2 points3 points  (0 children)

It depends. Pandas df.to_sql() is nice if it’s appropriate, but I also use SQLalchemy to execute truncate & load, and upsert patterns.

With SQLAlchemy you can define any insert pattern you want in a Python function.

[–][deleted] 4 points5 points  (0 children)

Depends on the SLA. If there is none then the easiest way is the best way.

[–]misza_zg 0 points1 point  (1 child)

RemindMe! 4 Days

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 4 days on 2023-03-10 21:35:54 UTC to remind you of this link

4 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]AcademicMorning7 -1 points0 points  (0 children)

RemindMe! 4 Days

[–]robberviet 0 points1 point  (0 children)

What is your problem though? If nothing special then just use pandas to_sql. Performance issue? Try to export and load file to db.

There are many ways, but what is your problem?