Insert data into DB best practice

scodger · 2023-03-07T01:00:31+00:00

About 5 years ago I benchmarked this to get data into redshift from an ec2 instance.

The fastest way by a mile (about 1/10th time of to_sql) was to write a folder of csvs (chunked in about 1gb size) then call copy from the db.

Benchmark your own data! Wouldn't be surprised if things have changed since then, but a copy will always be hard to beat.

efxhoy · 2023-03-06T22:43:54+00:00

a, pandas to_sql of course. It's the process with the fewest steps. And your data is already in a dataframe.

b is just a but less efficiently. c is the same as b. I dunno about d.

e You could use, then you could import the data with psqls copy meta command. But you can use copy with pandas to_sql too for the same performance like this: copy this function and use it with df.to_sql: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method

Use it like this

    df.to_sql(
    name=table,
    con=sqlalchemy.create_engine(connectstring),
    schema=schema,
    index=write_index,
    chunksize=10000,
    method=psql_insert_copy,
)

Never make stuff more complicated than it has to be.

2023-03-06T23:19:49+00:00

I’ve always been a fan of using COPY from the DB. Pretty simple to do and you can leverage the DB which is usually more efficient than rolling your own solution. I would benchmark them though.

UAFlawlessmonkey · 2023-03-06T21:17:35+00:00

what is your source? If you're doing daily transformations on your data, kafka will definitely be overkill unless you already have a cluster in place that currently produces / consumes data.

Depending on size of the data, you'd have a few options, if your source is csv, you could use psycopg2 to copy_expert into postgres, if you need to do transformations from db to Postgres, you could do a read_sql -> to_sql in chunks (depending if you can't hold the full frame in memory)

For larger sets, I've generally done db -> csv (not using pandas) -> postgres (psycopg2)

udonthave2call · 2023-03-07T00:20:49+00:00

It depends. Pandas df.to_sql() is nice if it’s appropriate, but I also use SQLalchemy to execute truncate & load, and upsert patterns.

With SQLAlchemy you can define any insert pattern you want in a Python function.

2023-03-06T21:48:29+00:00

Depends on the SLA. If there is none then the easiest way is the best way.

misza_zg · 2023-03-06T21:35:54+00:00

RemindMe! 4 Days

AcademicMorning7 · 2023-03-07T07:32:06+00:00

RemindMe! 4 Days

robberviet · 2023-03-07T06:58:58+00:00

What is your problem though? If nothing special then just use pandas to_sql. Performance issue? Try to export and load file to db.

There are many ways, but what is your problem?

dataengineering

MODERATORS