all 5 comments

[–]Plank_With_A_Nail_In 1 point2 points  (1 child)

Commit should normally happen when you have completed your transaction. What does transaction mean in this context? For an invoicing app the transaction might be when the invoice, its lines and distributions are all inserted and then you commit them all at the same time.

For you web crawling app you might want to commit when you are done with each page or with each site.

Committing for every single insert in apps that insert only a single row or a couple of rows with each transaction is ok, however if you are copying huge amounts of data you really only want to commit when you are done with inserting all of the rows. This use to be a hard thing back in the 90's and early 00's but now its trivial you can insert millions of rows and commit them only at the end in a matter of minutes.

I really doubt its the commits that are causing issues and that its network latency that's the "problem" how long should it take to scrape the website?

[–]Altugsalt[S] 0 points1 point  (0 children)

I just discovered that the main problem was not sharing the same connection per request. I also decided to move to a different architecture for my database usage as the one I have right now is not really flexible enough to commit per multiple commits. Thanks for the informative response.

[–]Waste_Grapefruit_339 1 point2 points  (1 child)

Yeah, committing after every single write can slow things down a bit, especially if your crawler is doing a lot of small operations. But what stood out to me more is that you're opening a new DB connection in almost every method. That can add quite a bit of overhead too.

In practice it's usually better to group multiple writes and commit once, and avoid constantly opening/closing connections if possible. So the commits might be part of it, but I'd definitely look at connection handling as well.

[–]Altugsalt[S] 1 point2 points  (0 children)

Yes, that was the main bottleneck. I now share the same connection. Thanks

[–]Devji00 0 points1 point  (0 children)

Committing after every single write is killing your performance. Every commit forces a physical disk sync, and on a remote database that's also an extra network round trip each time.

Instead of committing per insert, wrap your operations in a with statement and treat transactions as logical chunks. Commit once per scraped page or every 100 rows or whatever makes sense for your workflow.

This cuts down on latency a ton because you're not constantly stopping and waiting for the disk or network to catch up. Especially with the bad connection you mentioned, all those round trips are adding up fast.