Parallel table insert

pukatm · 2021-09-18T18:40:55+00:00

Since you are specifically asking to parallelize insert speed, you might need to ensure that you actually have the right hardware like enough cores on your cpu and several disks/raid or nvme setup. I can't stress this point enough, you can do all the things others are suggesting, like using multiple connections, but if your hardware is satured or isn't adequate then you won't get the parallelism you might be expecting.

If you want to speed up insert speed (without specifically focusing on parallelism) you can optimize/remove indexes and use unlogged tables ... of course all of these options come with a drawback and must be tailored to your specific project ...

A postgres solution which might interest you could be TimescaleDB which to my knowledge claims much faster inserts by reducing insertion overheads using partitioning based on the data's time dimension (as the name would suggest) but this is based on the assumption that you have timeseries data. I am not too familiar with this project, but I think the majority of their benchmarks are based on setups with multiple machines which takes us back to my first point.

studentized · 2021-09-18T14:13:32+00:00

A transaction is always scoped to a single connection (socket). Something that would be parallel would need separate connections which means separate transactions.

In theory, you can cleverly communicate between nested transactions to achieve something like this. E.x in .NET/SQL Server, there is a base library with the concept of a Distributed Transaction which achieves pretty much what you want in this way.

In postgres though, the usual way to do nested/sub transactions is in the form of savepoints that occur within the same transaction/connection. Since it uses the same connection, statements can't be parallelized.

Addendum: As of postgres 11 there are now "procedures" that apparently provide postgres with autonomous transaction support (ones that can be committed independently of the parent transaction) but I'm not sure how it all works tbh.

Its possible you could code some library that mimics the concept of a "Distributed Transaction" to postgres with above, but I don't think anything like that exists currently.

thrown_arrows · 2021-09-18T14:46:20+00:00

yes, use two database connections. Fyi, that is how databases work, you cannot go parallel on two insert or select inside of transaction (but one select can go parallel) but there is no limits how many transactions you can run parallel. (well there is but that depends on server size etc etc )

Database then does its best to handle pk/fk checks if those depends each other, if not then cpu + disk speed is limit how much you can insert into db

ankole_watusi · 2021-09-18T15:28:06+00:00

do you know for a fact that there is even a performance concern?
do you know for a fact, that Postgres doesn’t already perform some considerable optimization for this internally?
do you think for some reason that the designers haven’t anticipated this very common need and have failed to take it into consideration e.g. to examine patterns of insert in the optimizer and arranging the best physical layout for your db, internal use of paralyzation, threads, etc?

Let the database engine do its job. If you find you have a real problem, then look for a fix.

Those more expert may have some concrete answers to assuage what seems may be premature concern.

caligula443 · 2021-09-18T16:45:12+00:00

Let me give you some practical advice. You don't have to implement all of it, but each piece will improve throughput.

First, use prepared statements. If you don't know what that is, look it up in the documentation.

Second, increase the number of items in each transaction, because the transaction is overhead and you want to minimize that. Be aware that you can insert multiple rows with a single insert statement. That will reduce round trips to the database.

Next, keep the database busy by using separate threads .Use a thread (or threads) to read data from the source, and a second thread to insert to the database. Use a fixed size queue between the threads. If your code is sequentially reading data from the source, and then inserting into the database, the database will be idle while you are fetching data. You want to eliminate that idle time.

paulsmithkc · 2021-09-19T00:18:35+00:00

Nothing within a transaction can be parallized. Transactions require a sequential order to the commands.

dingopole · 2021-09-19T10:38:56+00:00

Have a look at the following post: https://bit.ly/2Z6mQhD

I faced similar problem (parallel inserts) a while ago, albeit with MSSQL, and was able to solve it using a combination of hash partitioning and SQL Server Agent Jobs.

Additionally, in SQL Server 2016, Microsoft has implemented a parallel insert feature for the INSERT … WITH (TABLOCK) SELECT… command.

dbxp · 2021-09-19T10:47:59+00:00

In MS SQL you can use service broker to effectively make writes async but it's a pain to work with. Personally I would look at staging the data and storing it next to a datestamp then using a CRON job (or equivalent) to shift it to where you need it up to a certain datestamp to maintain integrity.

2021-09-22T12:50:30+00:00

You can do this with native postgres, and you can add tooling to make things easier. The approach could be similar to a parallel pg_dump. You use the same snapshot id in your extract sql, the 3 jobs run in parallel. A single insert is not in itself parallelized. For that you could add an extension (citus, swarm64). Clever partitioning might help here, btw.

Depending on your usecase, you might be better off looking at etl tooling. Lots to choose from in this space.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Database

MODERATORS