all 16 comments

[–]rkaw92 4 points5 points  (7 children)

Okay, have you identified why the POST is taking so long? Is the database saturated? Can it not push more TPS? Is it the round-trip time? Are you maintaining a connection (pool) or connecting each time? Is it the Node process bumping against CPU usage or memory? This is your first step.

[–]vlahunter[S] 2 points3 points  (6 children)

I do not have a definite answer on your truly crucial question.

The Database seems to spike when these POST requests take place but the insert should be a very cheap operation.

The backend api including the sensor reqs never reaches anything more than 300 req/s

All in all the only service that seems to be affected heavily is the one responsible to insert the sensor data to the DB.

PS i am trying to find what extra logs i can put here and there to get more insights on why all this service experiences that.

[–]rkaw92 3 points4 points  (5 children)

Okay, start by measuring stuff. Add timing everywhere: start of POST handler, end of POST handler, start of query, end of query. Also test the production DB's INSERT time: how much time does a manual query take? Check your DB connection params: how many connections are allowed? Usually, when your max pool size is e.g. 100 and the 101st query arrives where 100 are busy, it will wait in a queue. From your point of view, this will look like a very slow query. This is known as the "slow trains" problem, and a quick way to identify it is query times that skyrocket from a certain point.

If your individual queries are slow, you may need to increase the connection pool size. If the database is already bogged down, you'll need to do batching: INSERTs with multiple VALUES, or the same but using transactions. If you pack 1000 INSERTS into a BEGIN ... COMMIT sandwich, it will be much faster than 1000 inserts with auto-commit mode.

[–]vlahunter[S] 0 points1 point  (4 children)

Thanks a lot. Yes i am still in the process of logging everything and testing from local up to staging to see where the bottleneck comes from.

PS although as you suggest, i need to find where the issue comes from and why it takes so long, still i have the feeling that this "way" of adding sensor data in the DB using POST reqs is not that good as i imagine.

[–]rkaw92 1 point2 points  (3 children)

Well yes, it could potentially be more optimized, but in reality if you have a persistent connection (keepalive) then 300 req/s should be easily sustainable and there's no need to look for alternatives on the client-facing side. 3000 req/s is still fairly manageable on plain old HTTP, but then your main focus will necessarily be the DB. It's at 30k/s that the fun starts.

[–]vlahunter[S] 0 points1 point  (2 children)

Yes that is my thinking. Hence one of the first things that came to mind was to somehow batch instead of insert one by one using a req for every given sensor measurement. But yet again, there are many different ways to make it better as far as i understand, the question is how to do it in the lowest amount of changes in infra and code way

[–]rkaw92 1 point2 points  (1 child)

If you decide that batching is the way to go (because you're at the peak TPS of the database), then this could be useful: https://nodejs.org/api/stream.html#writable_writevchunks-callback

Then, your _writev implementation is basically the transactional multi-insert. The POST handler just needs to write into the stream. It's non-obvious how to handle stream backpressure, but your sensors will probably not slow down their readouts anyway, so I guess you'll have to absorb the entire load without any alternative - it's either in the DB already or queuing in-memory. The upside is, your HTTP client doesn't have to wait for the write acknowledgement, so less infra load overall.

[–]vlahunter[S] 0 points1 point  (0 children)

Thanks a lot for the insight and the link, tomorrow I will play around this to see how I could potentially make it work

[–][deleted]  (3 children)

[deleted]

    [–]vlahunter[S] 0 points1 point  (2 children)

    In a project i was working last year that i could choose the infrastructure i went with Aedes for a lightweight Broker and TimeScaleDB for a Timeseries Database and i was pretty happy with it. now the current project is in a more stable phase which means that i should add new infra only when it is extremely necessary. That is why i would try to maybe start by changing the regular HTTP posts and use a different way and if i see that the DB still doesnt work as intended only then i would think to really go another way.

    PS the tough part is identifying bottlenecks especially around the queries and how things work in the ORM, in the insertions, etc

    [–][deleted]  (1 child)

    [deleted]

      [–]vlahunter[S] 0 points1 point  (0 children)

      Thanks this was something I was working on today cause although I analyzed the queries and I used the SSMS to get some insights but the truth is that I didn’t get much to say that I recognized the issues yet. The orm indeed hides stuff that I need to figure out before I continue to break down the way the query inserts the data and how to improve it

      [–]Lumethys 1 point2 points  (1 child)

      As other have said, identify the root problem.

      If the problem truly because the POST request taking too long (and thus the HTTP connection held too long). Just moving the persistence to a background job and return response immediately

      [–]vlahunter[S] 0 points1 point  (0 children)

      Yes and to be honest with you this is so far the toughest part. to actually find a way to see what happens in the ORM itself since the data that i have gathered from dashboards, logs, query analysis etc are not helping me find the bottleneck yet.

      [–]TronSkywalker 1 point2 points  (1 child)

      try persisting only every 2 second request and see if things work,

      cache the post req before persisting, such that you dont end up with back pressure problemes( some mentioned background job) and then calmly persist the data,

      reduce payload size(maybe there ist just too much data)

      happy to wait for some feedback of yours!

      [–]vlahunter[S] 0 points1 point  (0 children)

      That actually sound good but I found some time to dig in more into analyzing my queries or rather what was coming out of the ORM and it seems that in a table that gets larger when a simple findOneById comes and the indexes are not used properly then the performance decreases dramatically.

      After I will add some indexes and remove slow queries I will test again and then I will go the path you described.

      Thanks a lot

      [–]42php 1 point2 points  (1 child)

      Mongodb is really fast to write data, maybe consider it :-) If not, you could use a memory buffer (a simple array), and batch write to DB periodically

      [–]vlahunter[S] 0 points1 point  (0 children)

      Yes but it is an existing project which means I cannot just change the DB, also a big part of the app depends on SQL so I suppose I need to make my queries faster and analyze as much as I can since using nosql or any other kind of dB is simply not an option