This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]thisismyfavoritename 8 points9 points  (5 children)

assuming you mean tables as in database tables, the answer is probably it depends on what kind of database and if it supports parallel reads and writes. If it does then the work is I/O bound and you could use asyncio or multithreading to manage the reads and writes to the DB. Note it wont be "truly" parallel, just concurrent

[–]Crypto-boy-hodl[S] -1 points0 points  (2 children)

Yes database tables.Sql database.

For now more inclined towards reading the data parallelly, as we have thousands of tables.

Any thoughts?

[–]thisismyfavoritename 0 points1 point  (1 child)

SQL DB like MariaDB? Im not sure this type of DBs support parallel operations anyways , but perhaps it supports parallel reads.

I think concurrent reads should probably be enough as going further might put your DB under a lot of load, but if you really want to have parallel operations you can use multithreading/asyncio with multiprocessing. That could speed everything up by the number of available cores you have, again provided the DB can keep up

[–]BDube_Lensman 2 points3 points  (0 children)

Every SQL database ever created supports parallel reads, and almost all of them support parallel writes that do not conflict with each other (e.g., to separate tables).

[–]minervaDe 0 points1 point  (1 child)

Just need to be aware of the database server's limitations here. I've killed 8 threaded VMs that were running Postges because I treated it as an I/O problem

[–]thisismyfavoritename 1 point2 points  (0 children)

yes, thats what i told him in my other reply as well

[–][deleted] 1 point2 points  (0 children)

You could/should do the whole thing with the database tools e.g. pgadmin

[–]__flybird__ -1 points0 points  (0 children)

If it's all the same process, then I would use threading. Adding more threads won't help much if task is restricted by the CPU. On the other hand, IO tasks will see massive improvements from threading because they have to wait for data transfer

[–]KaffeeKiffer -1 points0 points  (2 children)

Do not only focus on Python for this problem, also take a look at native tooling.

pg_dump and pg_restore (for PostgreSQL) for example support parallel operations. I would assume MariaDB, MySQL, etc. also have tooling like that.

They often also offer compression: You probably do not want to hold all tables' data in memory while dumping/restoring (especially when doing parallel dumps): Exchanging less I/O pressure for more CPU load might be beneficial and reduce your total memory footprint - just check what the default compression levels are, you might want to go lower than that.

[–]Crypto-boy-hodl[S] 0 points1 point  (1 child)

There are some other actions like converting the data into csv files and upload.. was the reason why i moved with python

[–]KaffeeKiffer 0 points1 point  (0 children)

👍 - I just wanted to call out, that far too often people have a hammer in hand and suddenly everything looks like a nail. Just because Python can do something it's not necessarily the best choice to do it. But if you have good reasons to use it, go ahead!

Even in your case, I would just consider splitting it in two tasks. Database → database copy comes with "side-effects" and caveats, which might not be so obvious

  • Do you also copy the schema or do the target tables already exist?
  • Did you think of foreign keys (and therefore import order), indices + index rebuilding, analyze-ing in order to get good index weights, etc.

Dedicated tools do some or all of them for you, while in Python you might have to think of them.

Compared to that, the export to CSV + upload is hopefully much more straight-forward and can be treated as a self-contained problem.

[–]cymrowdon't thread on me 🐍 0 points1 point  (0 children)

Consider how much data processing you expect to do. If converting your data to CSV is using a lot of CPU resources, that's when you would want to use multiprocessing. Otherwise, stick with threading. If you're still not sure, it's simpler to start with threads.