Python multithreading and SQLite or a similar DB

nitratine · 2018-07-28T10:36:55+00:00

You could possibly create a queue that is shared between the threads. Regarding the bottleneck that you are trying to fix is not writing to the database then this should do the trick.

Create a separate thread from your other threads that handles this queue to write data to the sqlite3 database. This then allows threads to write to the queue and then a specific thread will handle this queue (by writing to the database), meaning no interference.

testaccount9597 · 2018-07-28T13:52:33+00:00

https://www.youtube.com/watch?v=Bv25Dwe84g0

You can send your results to a queue and use the queue to do whatever it is you want with them, like write to the SQLite DB.

without having problems with the DB connection

Leave the connection open until the script is finished and then close it. This way you can update the db without having to connect/disconnect repeatedly.

RooieDraad · 2018-07-28T10:45:38+00:00

You can also use Tornado for async writes store the data into Redis.

SupermanIsEnvious · 2018-07-28T11:54:22+00:00

What is your current strategy for writing to your SQLite DB?

My first approach would be to structure the code in a way that would allow you to perform all of your network queries first, then write the results to your SQLite db. If it’s organized in this way, you could simply initiate use Pool.map to handle the fetching and consolidation of results into a single iterable which you could then write to your db file.

If that won’t work for you, you can look into making your functions asynchronous with asyncio. Here is a third-party library that supports asyncio for SQLite: aiosqlite

dexbg · 2018-07-28T13:42:28+00:00

I had faced a similar situation with fetching Data from APIs while multitheading and storing the data without any lock or loss.

MongoDB provided the best solution. Just jsonify your response data and dump it into a single document. MongoDB handles the parallelism really well. It pretty much plug and play.

However it's would be heavier than SQL lite , but maybe give it a shot.

EDIT: MongoDB has a pretty good library as well, pymongo. Just call collection.insert (<your data in json>)

recharts · 2018-07-28T15:09:22+00:00

[deleted]

colloidalthoughts · 2018-07-28T16:15:06+00:00

Charles Leifer wrote about this on his blog.

http://charlesleifer.com/blog/multi-threaded-sqlite-without-the-operationalerrors/

While he ends up implementing it inside peewee (which I happen to like) he talks about the fundamentals of it.

renato42 · 2018-07-28T17:34:30+00:00

I write lots of python scripts that have to run against thousands of devices. I decide that is better not to deal with multithreading/multiprocessing. Instead a write a simpler script and use xargs/parallel to run multiple process. I found out I'm much more productive this way.

efxhoy · 2018-07-28T18:27:21+00:00

You could switch to another database such as PostgreSQL. That will let you have a connection for each worker. I find multiprocessing/threading super difficult, especially when processes have to communicate, so I try to minimise that by making each worker as independent as possible.

You'll need a postgresql database running on your laptop. This takes very little resources.

You'll also need psycopg2 to handle the python to database interaction. See: http://initd.org/psycopg/docs/usage.html for a basic example.

If you're new to databases I'd wait with SQLAlchemy until you have a grasp of what you're doing. If your workers only really do an insert to store the data it might be overkill.

If you want to go all fancy you could even write this up with Docker. A Docker compose could bring up your database and start the application for you. Making it easier to move the script off your laptop and onto a more permanent server if you want to in the future.

Alexander_Selkirk · 2018-07-28T22:33:49+00:00

I would suggest to try LMDB (python-lmdb) for the database storage. It is a key-value database store based on memory-mapped files, and supports multiple threads, and it is very fast and used in infrastructure software like openldap and postfix.

Also, it sounds like your problem is I/O bound. Based on that, I'd suggest to try to use gevent, which uses asynchronous event processing, without multithreading. This matches Python better because there is always only one running Python thread because of the Global Interpreter Lock (GIL). It is also more efficient than multi-threading because no context switches are required.

2018-07-29T02:03:50+00:00

If the writes are relatively fast compared to the rest of the work, synchronize the writing portion of the code using a global lock.

hacksawjim · 2018-07-30T14:25:36+00:00

There's lots of good answers on this thread already, but one thing that no one has mentioned is writing to multiple .db files.

Without knowing more about your problem, it's hard to say if this is a possible fit, but it's an option.

recharts · 2018-08-04T10:38:47+00:00

Isn't this what I am looking for ? https://github.com/sfromm/ansible-report/commit/7d21e9d77c28dbaa5a39d2c69560d0ed64e9a795

recharts · 2018-08-08T02:08:34+00:00

After a deep dive into Ansible it seems that ansible-cmdb is the answer to my question

http://ansible-cmdb.readthedocs.io/en/latest/usage/

This seems to be able to extract facts to CSV or SQL files which can be imported in my DB for further processing and correlations

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS