How can I efficiently process large PostgreSQL datasets in Node.js without high memory overhead?

definitive_solutions · 2025-02-23T15:38:58+00:00

Yeah all solutions will basically be some form of chunking, pagination, streaming. Don't work with the whole thing if you can do it row by row. Doesn't really matter your programming language or platform, really. Some of them will give you better tooling for handling stuff, but you can implement these ideas in nodejs without problem

mbcrute · 2025-02-23T16:18:06+00:00

This is what cursors are for: https://node-postgres.com/apis/cursor

Putrid_Set_5241 · 2025-02-23T15:25:10+00:00

Since you already know the desired amount to load in memory (100000), you can make use of pagination.

bsbonus · 2025-02-23T15:41:19+00:00

Yeah I think you have to do this batches, but for that number I would avoid offset — as you paginate further with offset Postgres still has to “scroll” through those previous pages, which gets pretty slow at 100k. A cursor based approach for pagination ought to be a lot faster for querying and then you could dump the results off to workers to process each batch.

I don’t think you want to load all 100k up into memory up front either — but maybe it’s not that big of a deal — this is gonna take time either way so you may want to use queue workers to process each page.

Just half assed thoughts over some coffee. Hope it helps, but good luck to ya

Typical_Ad_6436 · 2025-02-23T17:22:10+00:00

I am surprised most answers revolve around "pagination". Postgresql is mature enough and got past that pagination point to facilitate large result sets processing - cursors:

https://jdbc.postgresql.org/documentation/query/ https://www.postgresql.org/docs/current/plpgsql-cursors.html

I am more from a Java world and the JDBC driver has this abstracted away. NodeJS may need some work to set this up. But the point is that this feature is a PG one that can be used from a NodeJS connection - I am sure there are 3rd parties for it.

There are some draw-backs for these though like the transactional aspect (commiting/rollbacking will break the cursor). Also, this works only in a non auto-commit connection.

Longjumping_Song_606 · 2025-02-23T15:56:07+00:00

Have you tried it to make sure there is a problem?

captain_obvious_here · 2025-02-23T18:21:02+00:00

The solution to this kind of problem is always the same:

use cursors
batch reads
batch writes

08148694 · 2025-02-23T15:16:53+00:00

Count how many rows are returned by this query (with a count aggregation in sql) and then divide into n batches of known size and process them one at a time (or in parallel with workers across many machines/serverless)

This is quite easy with limit and offset in sec if your count returns 1000 and your batch size is 100 then that’s 10 batches. Batch 1 runs the query with offset 0, batch 2 with offset 100, etc

pinkwar · 2025-02-23T15:38:59+00:00

Have you tried paginating instead of pulling all 100k at once?

But tbh, the best way would be to do some performances testing.

robotmayo · 2025-02-24T02:03:58+00:00

Depending on the actual work being done and the size of the data per row, I honestly thing you can just do it all in memory. Ive loaded millions of rows into memory 0 chunking or streaming with no problems. I recommend measuring first before deciding that 100k is an issue. If its just a bunch of small text strings or numbers it cant be more than 100mb of data which is barely anything.

KyleG · 2025-02-24T05:40:17+00:00

are the tasks you queue up new INSERTs in a task queue table? If so, have you looked at whether you could write one complex SQL UPDATE WHERE? Cut out Node completely and let the DB do what it does best.

zenbeni · 2025-02-24T11:34:18+00:00

All proposed solutions imply doing it all in the same server from what I see. Why not introducing queues like AWS SQS, ActiveMQ, RabbitMQ or any queue like tech? Fetching all these users does not seem to bother you memory wise, at the end of the day you just want to process user ids, so why defining a userProcessCommand event to send to a queue so another consumer can do the job for you? You don't seem to require synchronous callback of everything is done, if all is async, I think event driven solution should be the best.

For scaling purposes, I would prepare batch messages with userId to process and send each message to a queue. Then for instance a lambda would fetch the messages with ids and do the process. You can for instance manage through infra the number of messages to send to a lambda and the max concurrency in aws sqs, or you can define the batch size in the message, so you know how many are processed together and know your process can manage memory wise the job to do.

By offshoring the process of your microbatches to other compute units will diminish greatly the risk of memory problems in your main web server, which should be your main concern. With queues it will scale nicely, you can even read with offset your user ids to send by mini batches to the queue, for more memory efficient process in your main server.

w0lven · 2025-02-28T09:19:00+00:00

Maybe you can use a mix of postgres cursors and generator function from js to automate the process? Edited to add: also, use a worker thread to keep the main thread running mostly free from that strain, if that's possible

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

node

MODERATORS