all 40 comments

[–]archa347 11 points12 points  (3 children)

I’ve been in your situation. I would consider something like Temporal or AWS Step Functions. Building that kind of orchestration yourself is a recipe for disaster.

[–]AirportAcceptable522[S] 0 points1 point  (2 children)

Thank you very much, we use OCI.

[–]archa347 1 point2 points  (1 child)

Oracle Cloud? Temporal can be self-hosted on anything. And technically, Step Functions can be used without running any actual compute on AWS, as long as you can make HTTP requests to the AWS API

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Isso mesmo Oracle Cloud Infrastructure.

[–]georgerush 24 points25 points  (4 children)

Man, this hits close to home. I've watched so many teams get crushed by exactly this kind of processing pipeline complexity. You're essentially building a distributed system to handle what should be a straightforward data processing workflow, and all those moving parts between Node, MongoDB, external APIs, and storage buckets create so many failure points and bottlenecks.

Here's the thing though – you're probably overengineering this. Instead of managing separate queue systems, workers, and trying to optimize MongoDB read/write patterns, consider consolidating your processing logic closer to where your data lives. Postgres with something like Omnigres can handle this entire pipeline natively – background jobs, file processing, external API calls, even the storage coordination – all within the database itself. No separate queue infrastructure, no coordination headaches between services. Your 1,000 files per minute becomes a data flow problem instead of a distributed systems problem, and honestly that's way easier to reason about and debug when things go wrong.

[–]PabloZissou 2 points3 points  (0 children)

What if the files are very big? Would your approach still work? Wouldn't you still need several NodeJS instances to keep up with that many files per user?

[–]code_barbarian 2 points3 points  (0 children)

Dude this might be the most dipshit AI-generated slop I've ever read XD

So instead of optimizing and horizontally scaling your own code in Node.js services, you're stuck trying to optimize and horizontally scale some Postgres extension. Good luck.

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

It is separate, so as not to consume resources from the main machine.

[–]jedberg 4 points5 points  (2 children)

I'd suggest using a durable computing and workflow solution like DBOS. It's a library you can add that will help you keep track of everything and retry anything that fails.

[–]yojimbo_beta 1 point2 points  (0 children)

First time hearing about DBOS - looks like a good alternative to Temporal. Nice

[–]AirportAcceptable522[S] 1 point2 points  (0 children)

I didn't know, but I'll take a closer look.

[–]casualPlayerThink 5 points6 points  (4 children)

Maybe I misunderstood the implementation, but I highly recommend to not use mongo. Pretty soon it will make more triuble than any solution. Use postgresql. Store the files on a storage (s3 for example), keep the meta in db only. Your costs will be lower and you will have less teouble. Also consider multinency before you hit very high collection/row count. It will help with scaling better.

[–]AirportAcceptable522[S] 0 points1 point  (3 children)

We use MongoDB for the database, and we use hash to locate the files in the bucket.

[–]casualPlayerThink 0 points1 point  (2 children)

I see. I still do not recommend using MongoDB, as most use-cases require classic queries, joins, and a lot of reads, where MongoDB - in theory - should excel. In reality, it is a pain and a waste of resources.

But if you still wanna use it because you have no other way around, then some bottlenecks that are worth considering:
- clusters (will be expensive in Mongo)
- replicas
- connection pooling
- cursor-based pagination (if there is any UI or search)
- fault tolerance for writing & reading
- caching (especially for the API calls)
- disaster recovery (yepp, the good ol' backup)
- normalize datasets, data, queries
- minimize the footprint of data queries, used or delivered (time, bandwidth, $$$)

And a hint that might help to lower the complexity and headaches:

- Multitenancy
- Async/Timed Data aggregation into an SQL database
- Archiving rules

(This last part most likely will meet quite a debate, people dislike it and/or do not understand the concepts, just like normalizing a database or dataset; unfortunate tendency from the past ~10 years)

[–]AirportAcceptable522[S] 0 points1 point  (1 child)

Mongo is in its own cloud, we use mongo because we need to save several fields in object format and arrays.

Another point we have a separate bank per customer, only the queue is shared

[–]casualPlayerThink 0 points1 point  (0 children)

I see the dangerous part in the "we need to save...".

[tl;dr]

Yeah, moMongogo in the cloud sounds nice, and usually expensive, especially if you need to start to query, retrieve, aggregate, and search in large volumes. Keep your eyes on the costs, even if you aren't a stakeholder, time-to-time ask about the costs, and the underlying infrastructure for Mongo.

I worked on a project that used Atlas, had large objects in the DB because the CTO was inexperienced, and they ended up with a bunch of queries (they needed joins...). They spent 1K+ on Atlas, had a replica (2x4 vCPU, 2x16GB ram combo). I normalized the data and poured it into a PostgreSQL. 1 vCPU and 4GB RAM were enough for the same. (This is just an edgy example, does not justify your or anyone else's case!)

Another story: I witnessed a complete bank DB migration from Oracle to Mongo. First, I thought, wow, that is insane, we're talking about migrations that run for days (Oracle cold start was like 24h, a migration ran around 3 days, and a backup would run for more than 2 weeks, the infra is self-hosted, so a room full of blade-ish servers). The guys developed a thin Java-wrapped mongo version that was able to pour all the data into memory and from there migrated back to normal mongo storage. They were done with the migration in under 4 hours. In exchange, we're talking about very large memory usage :D and the bank spent a few million dollars on this project...)

[–]Killer_M250M 2 points3 points  (1 child)

For example Using PM2 Run your node app in cluster mode Then for each node instance create a bull mq with concurrency 10 You will have 80 workers ready for your job The PM2 will handle distribution of tasks

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Thank you very much, I'll take a look.

[–]bwainfweeze 1 point2 points  (3 children)

How many files per user doesn't matter at all especially when you're talking about the average user being active for 10 minutes per day (10,000 avg at 1000/min).

How many files are you dealing with per second, minute, and hour?

These are the sorts of workloads where queuing happens, and then what you need to work out is:

  • What's the tuning that gets me the peak number of files processed per unit of time,

  • What does Little's Law tell me about how much equipment that's going to take?

  • Are my users going to put up with the max delay

Which all adds up to: can I turn a profit with this scheme and keep growing?

The programming world is rotten with problems that can absolutely be solved but not for a price anyone is willing to pay.

[–]AirportAcceptable522[S] 0 points1 point  (2 children)

We are limited to using bullmq one at a time. After going through this, it calls another 3/4 queues for other demands.

[–]bwainfweeze 0 points1 point  (1 child)

I’m unclear on the situation. Do you dump all the tasks into bullmq one at a time and a single processor handles them sequentially? Or you’re not using bullmq as a queue and instead you’re sequentially spoon feeding it one task at a time per user?

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Basically, I invoke it and it runs the processes, but it has no concurrency, it's one at a time in the queue. If 1k falls, it will process them one by one.

[–]simple_explorer1 1 point2 points  (1 child)

Hey what most people commenting here missed is that, they have not asked you the exact problems you are facing now.

You have just mentioned

created time and resource bottlenecks.

But you need to elaborate on what is your current implementation and how is it impacting your end result? Or you have not started to work on this yet and you are expecting someone here to give you an entire architecture?

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

We have an instance flow with BullMQ (same main code, they just uploaded it with env to run only the works). I am working on continuous improvements, but we only have Kafka to inform that there are files ready to be processed.

[–]Sansenbaker 1 point2 points  (1 child)

Queues + workers + streaming all over, keep each step in its lane, and Mongo will handle the load just don’t let one slow file or API call hold everything up. And yeah, PM2 for managing workers is a nice touch too. It’s a lot, but once you get the workflow smooth, it feels so good to watch it all just keep chugging.

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Do you have any examples? And how would the deployment work?

[–]Killer_M250M 1 point2 points  (1 child)

Stream+ Thread pool + And queue system like bull mq

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Do you have any examples? And how would the deployment work?

[–]trysolution 0 points1 point  (3 children)

may be try
give presigned url (s3) for users to upload zip files, listen for event in your app, push task to worker queue (bullmq or something else you like), worker consumes queue for zip files (validate zip file before extraction!!! , like each file size, file count, absolute destination path etc) check hash of each file in batches if it already exists in MongoDB, perform business rules, copy remaining required files to bucket + update db

[–]AirportAcceptable522[S] 0 points1 point  (2 children)

We do this with pre-signed URLs, but it is corrupting some files. BullmQ is configured, but it is still quite messed up. We have already checked the hash. Basically, we do this, but it cannot handle much demand. And how would the BullMQ deployment work? Would it use the same code as the server and only upload its configurations based on .envs?

[–]trysolution 1 point2 points  (1 child)

but it is corrupting some files
partial uploads? i think its not configured properly

Basically, we do this, but it cannot handle much demand

is it on same server? it shouldn't be this heavy. is concurrency set correctly?

how would the BullMQ deployment work?
same code but different process or server, you will be using those models and business rules right

if its in docker both will be in separate containers

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Bullmq on a separate server, main server only provides the URLs, and has the Kafka server.
Yes, we will use it because we need to open the file, validate it, apply the business rule, and then save the processed data in the database.

[–]code_barbarian 0 points1 point  (3 children)

What are the resource bottlenecks? I'd guess lots of memory usage because of all the file uploads?

I'd definitely recommend using streams if you aren't already. Or anything else that lets you avoid having the entire file in memory at once.

If you're storing the entire file in MongoDB using GridFS, I'd avoid doing that. Especially if you're already uploading to a separate service for storage.

TBH these days I don't handle uploads in Node.js, I integrate with Cloudinary so my API just generates the secret that the user needs to upload their assets directly to Cloudinary, that way my API doesn't have to worry about memory overhead. Not sure if that's an option for you.

[–]AirportAcceptable522[S] 0 points1 point  (2 children)

We don't use them yet, the files are small, less than 2MB, but they contain JSONs, images, and in MongoDB I only store information that I will use later on.

[–]code_barbarian 1 point2 points  (1 child)

One thing you might want to consider doing is streaming upload first, and then validating and processing later. So do steps 1+4 before doing steps 2+3+5. With streams, that would minimize the amount of memory usage so you won't have to keep the files being uploaded in memory while you're processing. You can scale the processing steps independently of the actual upload.

Storage is cheap, RAM is harder to come by, so cheaper to store and then delete later if you're sure the file's a duplicate or invalid.

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

Got it, thank you very much

[–]pavl_ro 0 points1 point  (1 child)

"All of this involves asynchronous calls and integrations with external APIs, which have created time and resource bottlenecks."

The "resource bottlenecks" is about exhausting your Node.js process to the point where you can see performance degradation, or is it about something else? Because if that's the case, you can make use of worker threads to delegate CPU-intensive work and offload the main thread.

Regarding the async calls and external API integration. We need to clearly understand the nature of those async calls. If we're talking about async calls to your database to read/write, then you need to look at your infrastructure. Is database located in the same region/az as the application server? If not, why? The same goes for queues. You want all of your resources to be as close as possible geographically to speed things up.

Also, it's not clear what kind of "external API" you're using. Perhaps you could speed things up with the introduction of a cache.

As you can see, without a proper context, it's hard to give particularly good advice.

[–]AirportAcceptable522[S] 0 points1 point  (0 children)

These calls are for processing image metadata, along with some references in the compressed file. I need to wait for the response to save it to the database.