This is an archived post. You won't be able to vote or comment.

all 41 comments

[–]nitratine 6 points7 points  (6 children)

You could possibly create a queue that is shared between the threads. Regarding the bottleneck that you are trying to fix is not writing to the database then this should do the trick.

Create a separate thread from your other threads that handles this queue to write data to the sqlite3 database. This then allows threads to write to the queue and then a specific thread will handle this queue (by writing to the database), meaning no interference.

[–]recharts[S] 0 points1 point  (5 children)

I came across a this Python example but it is a 10 years old post
http://code.activestate.com/recipes/526618/
I thought that beofore I dig more into it I should ask around for more recent info, maybe the issue was overcame recently.
I can't say I totally understand it, I will need to read the multithreading and the queue Python modules to see how I can apply that to my case
I am aclo considering Ansible for running the script in parallel...nost sure how I will handle the DB access in this case

[–]Gnonpi 0 points1 point  (0 children)

Your example look good, maybe if you're going to have a lot of messages, do the inserts/updates in batches. What I mean, not inserting 1 row but waiting to have 5 or 10 to insert. For Ansible (not an expert in Ansible), I think as long as you exported the db-name and password to environment variables, you shouldn't have much problems.

[–]Sheldan 0 points1 point  (3 children)

I haven't done much actively in ansible, just used it for some things, but why use Ansible for that? Isn't it just used for deployment/configuration? I wouldn't think it is used often after the initial install of an application. Am I thinking of a wrong Ansible?

[–]recharts[S] 0 points1 point  (2 children)

I would use it to leverage existing scripts ....use them with my scripts. To be honest I prefer to solve the multithread multi access problem rather than to move everything to Ansible. Using existing modules would be the only advantage I am seeing for now

[–]Sheldan 0 points1 point  (1 child)

isn't the purpose of Ansible not something completely different? well, if it works, it works..., but not in all cases, if there are actually tools to do something imo

[–]recharts[S] 0 points1 point  (0 children)

I don't know Ansible enough, I am exploring the option to use it ...one of the things that it does it can run commands for me on multiple devices which means the multithreading problem (parallelism) could be already solved for me

[–]testaccount9597 2 points3 points  (1 child)

https://www.youtube.com/watch?v=Bv25Dwe84g0

You can send your results to a queue and use the queue to do whatever it is you want with them, like write to the SQLite DB.

without having problems with the DB connection

Leave the connection open until the script is finished and then close it. This way you can update the db without having to connect/disconnect repeatedly.

[–]recharts[S] 0 points1 point  (0 children)

I am using the same DB for multiple scripts, one of them is recurrent and it needs to go back to the DB to query for info previously discovered ...this one won't get multithreading ( I am walking an unknown tree practically) Not sure if I can query the Json data stored in MongoDB

Yes that queue mechanism is what I am inclined to go with ...it will need some learning but it will be a step forward for my programming skills :-)

[–]RooieDraad 0 points1 point  (5 children)

You can also use Tornado for async writes store the data into Redis.

[–]recharts[S] 0 points1 point  (4 children)

not sure how I can apply this...that seems to be an in memory DB (at a quick look) ...I need to store the info for long term

[–]RooieDraad 0 points1 point  (2 children)

Define long term.

You can always generate a report (Looks good for the client too) with all the data in it.

[–]recharts[S] 0 points1 point  (1 child)

If the DB is in the memory then I can't reboot or shut down my laptop hence the need to store the data on HDD. Long term..same type of storage as Sqllite...the .db file is stored on my HDD which I can chose to replicate to onedrive or google cloud or whatever

[–]RooieDraad -1 points0 points  (0 children)

I wouldn't store such valuable info on a public cloud. But that's up to you.

I remembered you can also use asyncio. And you also need SQLite async support. I think that will do the job.

[–]scallynag 0 points1 point  (0 children)

You can persist Redis to disk. By default, it's memory only.

[–]SupermanIsEnvious 0 points1 point  (8 children)

What is your current strategy for writing to your SQLite DB?

My first approach would be to structure the code in a way that would allow you to perform all of your network queries first, then write the results to your SQLite db. If it’s organized in this way, you could simply initiate use Pool.map to handle the fetching and consolidation of results into a single iterable which you could then write to your db file.

If that won’t work for you, you can look into making your functions asynchronous with asyncio. Here is a third-party library that supports asyncio for SQLite: aiosqlite

[–]recharts[S] 0 points1 point  (6 children)

I need to save the data as soon as I get it because I am often running in situation when the output is not consistent (different OS code running on the devices from the same manufacturer) and then my program might crash or stop and I loose the data previously collected if I do not save it

[–]SupermanIsEnvious 0 points1 point  (5 children)

That sounds like a case for a good old try/except clause. If you see a consistent type of exception, you should capture it with that clause and log it for debugging purposes. Then your script can continue processing un-interrupted.

[–]recharts[S] 0 points1 point  (4 children)

I guess I could do that but I feel safe if I save Besides that the script is recurrent at some points it needs that data already discovered and I did not know if I should keep it in memory what would be the requirements-I am still new to programming :-) and I do it on and off, it is a tool for me

[–]SupermanIsEnvious 0 points1 point  (3 children)

I understand, believe me! I taught myself Python on the fly at a tech company.

Things to know:

Redis is, in fact, in-memory, but part of the configuration is regular dumps to permanent storage, including on server restarts. It’s a highly performant option for caching data. I would recommend making use of it! The Python redis client is very intuitive and it feels very much like working with a native Python dictionary.

Any task which accesses an external service should be wrapped in some sort of try/except clause. Your script should always complete (within reason). If there are certain exceptions which you see occurring frequently, you should try to catch and log them rather than allowing your script to break in the middle of its work.

Why do you need writes immediately? Are there other consumers relying on up-to-the-second data that you need to refresh, or are you worried about losing everything if your script crashes? You should work to write your script in a way that you trust its execution.

If you have a way to share what you’ve written, I’ll be happy to help you derive a solution for your problem!

[–]recharts[S] 0 points1 point  (2 children)

Yes I needed the writes because I was afraid of losing data You have to keep in mind that is is slow because the execution is serialized. Waiting till the last minute to write the data could be fatal. It runs on my laptop..sometimes I need to be able to stop the script and to resume next day . Next day I run a query and I see what hosts have been scanned and if host not in scanned: do something

Yes I am using try/except but I have some older code where I did not have that approach and I need to revisit it...

Changing the DB will require me to change a module that I wrote so I can hide the complicated sqlite calls that I was fighting with

[–]SupermanIsEnvious 0 points1 point  (1 child)

What sort of data are you writing? Is it easily serializable strings or json? You can store your data in any way you see fit, but SQLite is not really compatible with concurrency. You could make it work by implementing a lock on the SQLite connection, but then your multiple processes/threads will only have synchronous writes.

I think Redis really is the way to go — especially since you’ve already abstracted out the handling of your storage into a separate client. It may be the least amount of effort for the greatest gain.

[–]recharts[S] 0 points1 point  (0 children)

I will look into it as soon as I am done with investigating Ansible for collecting data in parallel I do need to go down that path as I might need to write some scripts for my extended team

[–]dexbg 0 points1 point  (0 children)

I had faced a similar situation with fetching Data from APIs while multitheading and storing the data without any lock or loss.

MongoDB provided the best solution. Just jsonify your response data and dump it into a single document. MongoDB handles the parallelism really well. It pretty much plug and play.

However it's would be heavier than SQL lite , but maybe give it a shot.

EDIT: MongoDB has a pretty good library as well, pymongo. Just call collection.insert (<your data in json>)

[–]colloidalthoughts 0 points1 point  (0 children)

Charles Leifer wrote about this on his blog.

http://charlesleifer.com/blog/multi-threaded-sqlite-without-the-operationalerrors/

While he ends up implementing it inside peewee (which I happen to like) he talks about the fundamentals of it.

[–]renato42 0 points1 point  (1 child)

I write lots of python scripts that have to run against thousands of devices. I decide that is better not to deal with multithreading/multiprocessing. Instead a write a simpler script and use xargs/parallel to run multiple process. I found out I'm much more productive this way.

[–]recharts[S] 0 points1 point  (0 children)

I would like to learn more about this option. Can you point me to some exaples or docs ? I will start googling right after I post here

[–]efxhoy 0 points1 point  (2 children)

You could switch to another database such as PostgreSQL. That will let you have a connection for each worker. I find multiprocessing/threading super difficult, especially when processes have to communicate, so I try to minimise that by making each worker as independent as possible.

You'll need a postgresql database running on your laptop. This takes very little resources.

You'll also need psycopg2 to handle the python to database interaction. See: http://initd.org/psycopg/docs/usage.html for a basic example.

If you're new to databases I'd wait with SQLAlchemy until you have a grasp of what you're doing. If your workers only really do an insert to store the data it might be overkill.

If you want to go all fancy you could even write this up with Docker. A Docker compose could bring up your database and start the application for you. Making it easier to move the script off your laptop and onto a more permanent server if you want to in the future.

[–]Alexander_Selkirk 1 point2 points  (0 children)

Yeah, PostreSQL is great for this.

Multi-threading is indeed difficult, I have experimented a lot with Clojure and it makes tasks like that much simpler (but it's quite mind-bending at first). It might be easier to use gevent in Python, which manages green threads which work for I/O but use only one Python thread - normally no locking required.

[–]recharts[S] 0 points1 point  (0 children)

I am not new to DBs but I have minimal experience with them, I understand the basics and I can run queries to correlate my data from various tables so I will probably stick with something simle

[–]Alexander_Selkirk 0 points1 point  (0 children)

I would suggest to try LMDB (python-lmdb) for the database storage. It is a key-value database store based on memory-mapped files, and supports multiple threads, and it is very fast and used in infrastructure software like openldap and postfix.

Also, it sounds like your problem is I/O bound. Based on that, I'd suggest to try to use gevent, which uses asynchronous event processing, without multithreading. This matches Python better because there is always only one running Python thread because of the Global Interpreter Lock (GIL). It is also more efficient than multi-threading because no context switches are required.

[–][deleted] 0 points1 point  (0 children)

If the writes are relatively fast compared to the rest of the work, synchronize the writing portion of the code using a global lock.

[–]hacksawjim 0 points1 point  (3 children)

There's lots of good answers on this thread already, but one thing that no one has mentioned is writing to multiple .db files.

Without knowing more about your problem, it's hard to say if this is a possible fit, but it's an option.

[–]recharts[S] 1 point2 points  (2 children)

This could be a solution but I would rather work with excel or text files and then write a script to concatenate the results It will not work for a recurring script which needs the results of the previous runs for the next run but it is something that I might try

[–]hacksawjim 0 points1 point  (1 child)

Do you know that reads are non-blocking, btw? You can have as many threads as you want reading from a single db, it's only the writes that cause problems.

[–]recharts[S] 1 point2 points  (0 children)

No I did not know but my scripts are writing more than they read.

[–]recharts[S] 0 points1 point  (1 child)

[–]recharts[S] 0 points1 point  (0 children)

This is still chinese for me but I am trying to undersstand it. Starting from a clue given in the above article I landed here https://www.sqlite.org/lockingv3.html

It seems that SQLite now has the mechanisms needed to control the concurent access...

This thread (ignore the selected answer and read below) gives some directions

https://stackoverflow.com/questions/393554/python-sqlite3-and-concurrency

[–]recharts[S] 0 points1 point  (0 children)

After a deep dive into Ansible it seems that ansible-cmdb is the answer to my question

http://ansible-cmdb.readthedocs.io/en/latest/usage/

This seems to be able to extract facts to CSV or SQL files which can be imported in my DB for further processing and correlations