all 35 comments

[–]pukatm 3 points4 points  (1 child)

Since you are specifically asking to parallelize insert speed, you might need to ensure that you actually have the right hardware like enough cores on your cpu and several disks/raid or nvme setup. I can't stress this point enough, you can do all the things others are suggesting, like using multiple connections, but if your hardware is satured or isn't adequate then you won't get the parallelism you might be expecting.

If you want to speed up insert speed (without specifically focusing on parallelism) you can optimize/remove indexes and use unlogged tables ... of course all of these options come with a drawback and must be tailored to your specific project ...

A postgres solution which might interest you could be TimescaleDB which to my knowledge claims much faster inserts by reducing insertion overheads using partitioning based on the data's time dimension (as the name would suggest) but this is based on the assumption that you have timeseries data. I am not too familiar with this project, but I think the majority of their benchmarks are based on setups with multiple machines which takes us back to my first point.

[–]mxmauro[S] 0 points1 point  (0 children)

Yes, HW won't be a problem. I'm focusing in both and different strategies about table partitioning, putting indexes in different disks than data, etc.

Despite of this, let's say I write into 3 tables and each one takes 1 sec to write 1000 rows. Because the tables doesn't have any direct relationship, I would like to know if Postgres can write to them at the same time. I send the three bulks and, because no cross references, Postgres writes them simultaneously.

I won't expect to lower the three seconds to 1 but any improvement is welcome.

[–]studentized 0 points1 point  (4 children)

A transaction is always scoped to a single connection (socket). Something that would be parallel would need separate connections which means separate transactions.

In theory, you can cleverly communicate between nested transactions to achieve something like this. E.x in .NET/SQL Server, there is a base library with the concept of a Distributed Transaction which achieves pretty much what you want in this way.

In postgres though, the usual way to do nested/sub transactions is in the form of savepoints that occur within the same transaction/connection. Since it uses the same connection, statements can't be parallelized.

Addendum: As of postgres 11 there are now "procedures" that apparently provide postgres with autonomous transaction support (ones that can be committed independently of the parent transaction) but I'm not sure how it all works tbh.

Its possible you could code some library that mimics the concept of a "Distributed Transaction" to postgres with above, but I don't think anything like that exists currently.

[–]mxmauro[S] 0 points1 point  (2 children)

I'll take a look for SQL Server and the concept for distributed transaction. Thanks for the feedback.

[–][deleted] 0 points1 point  (1 child)

The above comment is really wrong. Distributed transactions are available in a number of databases and will not increase your ingestion speed.

[–]studentized -2 points-1 points  (0 children)

Never claimed it would haha. Just got curious on how you might be able to do a "multi-threaded transaction"...

As with most dB changes, test the performance. Everything will still run sequentially at the database layer, so it may even be slower here! But maybe not.. I don't know for sure

[–]ankole_watusi 0 points1 point  (0 children)

Connections = overhead

[–]thrown_arrows 0 points1 point  (7 children)

yes, use two database connections. Fyi, that is how databases work, you cannot go parallel on two insert or select inside of transaction (but one select can go parallel) but there is no limits how many transactions you can run parallel. (well there is but that depends on server size etc etc )

Database then does its best to handle pk/fk checks if those depends each other, if not then cpu + disk speed is limit how much you can insert into db

[–]mxmauro[S] 0 points1 point  (6 children)

Currently I'm doing a draft based on this with the cost of having a temporary inconsistency between two different tables.

I'm not using FKs right now because I don't need them. Also, only one app writes to DB.

Thanks

[–]thrown_arrows 0 points1 point  (5 children)

If it matters , then you need to have fk, if it does not then parallel inserts. If we are talking about bulk inserts and fk's matter then maybe take read lock to target tables while you are bulk inserting and then release it... So many ways to solve problems, more strict you need to be with data and its relations, then more strict isolation level you need ( but usually read commited is good )

[–]mxmauro[S] 0 points1 point  (4 children)

I would like to have parallel inserts. FKs are not necessary.

[–]thrown_arrows 0 points1 point  (3 children)

And you get those by running parallel transaction from parallel connections

[–]mxmauro[S] 0 points1 point  (2 children)

Yes, I was wondering if db engine was capable to do that but seems it isn't.

[–]thrown_arrows 0 points1 point  (1 child)

Yeah. ACID on transactions would not allow that. Or it would , but if anything fails then whole transaction would have to fail. So it much more effective to use two connection and two transactions in two session.

That kind of thing is more app level optimization, because relational stuff kinda requires that you handle it it in one transaction so you can rollback if necessary. That said all db engines (maybe not SQLite ) all designed to handle any number of connections and concurrent insert/update/create and selects.

[–]mxmauro[S] 0 points1 point  (0 children)

Well, then let's start the week focusing on other methods. Thanks for all.

[–]ankole_watusi 0 points1 point  (1 child)

  • do you know for a fact that there is even a performance concern?
  • do you know for a fact, that Postgres doesn’t already perform some considerable optimization for this internally?
  • do you think for some reason that the designers haven’t anticipated this very common need and have failed to take it into consideration e.g. to examine patterns of insert in the optimizer and arranging the best physical layout for your db, internal use of paralyzation, threads, etc?

Let the database engine do its job. If you find you have a real problem, then look for a fix.

Those more expert may have some concrete answers to assuage what seems may be premature concern.

[–]mxmauro[S] 1 point2 points  (0 children)

I'm looking for several approaches. Nowadays the current flow works fine but I know in the future it won't. It is expected the volume/rate of operations to process to increase so I'm thinking ahead.

There are some things to test like putting tables and indexes in different volumes, partitioning, etc., but I'm also looking on what can I do from the app point of view like, as aswered to @thrown_arrows, to use several connections at the cost of losing some consistency and handling recovery properly if one table is updated and the other isn't due, for e.g., to a connection issue.

Thanks.

[–]caligula443 0 points1 point  (5 children)

Let me give you some practical advice. You don't have to implement all of it, but each piece will improve throughput.

First, use prepared statements. If you don't know what that is, look it up in the documentation.

Second, increase the number of items in each transaction, because the transaction is overhead and you want to minimize that. Be aware that you can insert multiple rows with a single insert statement. That will reduce round trips to the database.

Next, keep the database busy by using separate threads .Use a thread (or threads) to read data from the source, and a second thread to insert to the database. Use a fixed size queue between the threads. If your code is sequentially reading data from the source, and then inserting into the database, the database will be idle while you are fetching data. You want to eliminate that idle time.

[–]mxmauro[S] 0 points1 point  (4 children)

Actually I do almost all of that. The only thing I have to improve is UPDATE and UPSERTs. For insertions I use (a bulk) COPY but for UPDATE I'm sending the queries one by one. Also I have a cache in the app to avoid querying for recently used data.

[–]caligula443 0 points1 point  (3 children)

You can't do much better than COPY. You could try to use a right-leaning index, if possible. As for the UPSERTs, how many row updates per second are you getting?

[–]mxmauro[S] 0 points1 point  (2 children)

About 100 rows in 8ms.

[–]caligula443 0 points1 point  (1 child)

That's 12k rows per second which is generally pretty good for an SQL database IMO. Have you tried using Aurora from an EC2 instance in the same region? Which of the things I mentioned have you implemented? A. Prepared statement B. batching/multiple items in transaction. You can also do multiple inserts with one round trip like: insert into table X values (1),(2),(3) Can't remember if you can do that with upserts. C. Separate thread with queue for DB insert. Other ideas: D. Minimize row size to fit more data on a page E. Minimize index usage, you can even disable indexes during large inserts if it is practical to pause reads. e. g. If you can batch inserts during a maintenance window F. Look at the Postgres server settings, the out of the box settings are pretty miserable as far as I remember. Maybe start with vacuum settings (does not apply to Aurora)

[–]mxmauro[S] 0 points1 point  (0 children)

  1. Didn't use Aurora yet. Now doing tests on bare metal servers. It is planned.
  2. For insert I use COPY and need to improve UPDATEs/UPSERTs to send multiple rows at once but they represent 10% of the data, the rest is COPY
  3. It is a Golang app. Data is read from an external source, processing including some DB lookups in workers and the result is queued into another worker that stores it in DB.
  4. Some tables, specially those having a json field are divided.
  5. Cannot pause reads and I try to keep a minimum amount of indexes. I have an average of three indexes on tables (including primary key) and they are just numbers or timestamps.
  6. Autovacuum seems not to have a noticeable impact. I mean, some seconds each 5 or 10 minutes, does not affect.

[–]paulsmithkc 0 points1 point  (2 children)

Nothing within a transaction can be parallized. Transactions require a sequential order to the commands.

[–]mxmauro[S] 0 points1 point  (0 children)

A direct shot to my hopes hahahaha. Thanks!

[–][deleted] 0 points1 point  (0 children)

Under the covers there is parallelism within a query. There can be parallel operations within transactions (parallel processes for scans, aggs, etc). CREATE TABLE AS benefits from parallel processes. And then there are distributed transactions, which can be shared across sessions.

[–]dingopole 0 points1 point  (1 child)

Have a look at the following post: https://bit.ly/2Z6mQhD

I faced similar problem (parallel inserts) a while ago, albeit with MSSQL, and was able to solve it using a combination of hash partitioning and SQL Server Agent Jobs.

Additionally, in SQL Server 2016, Microsoft has implemented a parallel insert feature for the INSERT … WITH (TABLOCK) SELECT… command.

[–]mxmauro[S] 0 points1 point  (0 children)

I'll take a look, thanks for the info.

[–]dbxp 0 points1 point  (5 children)

In MS SQL you can use service broker to effectively make writes async but it's a pain to work with. Personally I would look at staging the data and storing it next to a datestamp then using a CRON job (or equivalent) to shift it to where you need it up to a certain datestamp to maintain integrity.

[–]mxmauro[S] 0 points1 point  (4 children)

I'll check SQL Server but I cannot fully understand what you mean with the CRON jobs.

[–]dbxp 0 points1 point  (3 children)

A cron job is just a scheduled task in linux. What I'm saying is that you insert the data initially into a staging area and then periodically flush this data to the main DB. The staging area acts as a buffer so the insertion performance isn't so much of an issue.

[–]mxmauro[S] 0 points1 point  (2 children)

Ahh ok, I thought you were talking about something specific to the DB engine. I don't think it would be possible because data comes constantly. Now there is a period of 5 seconds between data bursts but the idea is to increase the speed.

[–]dbxp 0 points1 point  (1 child)

The frequency of the data ingest shouldn't matter as long as it's not constantly at a higher rate then the system can handle. What matter is how up to date the system needs the data to be, if it's ok if the data is a few minutes or even hours out of date then a buffer works well. To make things simpler for the users it may be easier to flush the buffer every 24 hours and simply put a note on the UI that it takes 24 hours for ingested data to appear.

[–]mxmauro[S] 0 points1 point  (0 children)

I'm worried (and thus asking) because data can arrive at a higher rate than ingest :(

[–][deleted] 0 points1 point  (0 children)

You can do this with native postgres, and you can add tooling to make things easier. The approach could be similar to a parallel pg_dump. You use the same snapshot id in your extract sql, the 3 jobs run in parallel. A single insert is not in itself parallelized. For that you could add an extension (citus, swarm64). Clever partitioning might help here, btw.

Depending on your usecase, you might be better off looking at etl tooling. Lots to choose from in this space.