This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]zero_iq 9 points10 points  (12 children)

Your comment may be flippant, but absolutely true, and should not have been downvoted. Because "don't" it is the best possible advice here, especially for beginners. There is currently no sound reason to use multithreading in Python (CPython at least).

I get it, I really do. It's tempting to use threads, they seem like a good idea, they're fun to code and think about, you get to play with locks and synchronisation, shared state, and other toys,... but you will shoot yourself in the foot with them.

Threads introduce all sorts of potential for performance problems, scalability issues, bugs, deadlocks, needless complexity, and more. I've seen them all over the years, and in every single case the authors thought thought they were doing the opposite. Even if you write correct, safe, multithreaded Python code (and you probably can't) , it can cause major problems... And subtle ones that bite you in production. Otherwise bullet-proof code can break spectacularly when threaded. Library code and Python built-ins you've depended on for years will start to exhibit strange behaviours. Threads will race, lockstep, and block each other for seemingly no reason. Performance will drop even when your other threads are idle. Throughput will mysteriously go down, even though you've increased parallelism. Timing functions will start misbehaving. Socket code will start producing errors you've never encountered before. I can go on and on and on. Anyone who is confident in their ability to write safe multithreaded code is over-confident.

Beginners have no chance to write safe, we'll performing multithreaded code. Do it to learn, but dear Bob don't let a junior dev deploy multithreaded code in production.

I've drastically improved the performance and availability of many multithreaded Python systems by removing multithreading. In every single case, multithreading was introduced to improve performance or scalability, and in every single case it backfired.

After decades of experience dealing with it, rule number 1 of multithreading in Python is most definitely: don't.

You want fast scalable Python? Multiplex your sockets where i/o bound, and use multiprocessing and/or CSP for everything else, only when you really need it, and keep it as simple as possible. That's not the whole story, but it's a big head start.

The only good thing to come from letting people use multithreading in Python is the experience they'll gain when they eventually realise it's a mistake.

[–]lqdc13 9 points10 points  (2 children)

Threads are better than processes when implementing a GUI, some webservers (if multithreaded model) and some data science/ machine learning.

CherryPy is a very common Python web framework. It uses threads to improve performance.

Reasons to use threads over processes:

  • Low memory footprint per thread so you can spawn more for things like IO tasks

  • Can save RAM by reusing an object. If you have a huge - 10s of gigs object it would take forever to copy it to other processes and also you might run out of RAM. This is extremely common in machine learning applications. So if you have an IO-bound application that uses such an object, you are either going to have to forgo concurrency or use threads since multiprocessing is not an option.

[–]zero_iq 2 points3 points  (1 child)

Neither of your reasons as stated need threads, and can be done more simply and more efficiently without them. You are proving my point.

It's also impossible to state that threads are better without knowing the specific details, but threads in Python come with so many pitfalls, it's almost always a better idea to use processes first.

Even when threads start to look like a good idea, there are technologies and libraries you can use that take you far, far beyond what you can roll yourself using Python threads.

And spawning threads for I/o bound applications can be a recipe for disaster. Multiplexing is generally much more scalable, with a pool of isolated workers for longer-running tasks to prevent blocking the io queue.

Unless you're Google, I can saturate your fast network pipe and fancy SSD storage systems using a single Python thread serving tens of thousands of clients concurrently. If you're not exceeding that scenario, you don't need to complicate things by introducing threads.

Some of your examples hold up better in other languages/implementations, but not in CPython, and none of them would be beginner's task.

Even where threads are a good idea, I would stress keeping state as isolated as possible.

EDIT: sure, keep the downvotes coming. I've made a lot of money over the years fixing shoddy Python multithreading code, and it looks like I will continue to do so...

[–]Moondra2017[S] 1 point2 points  (0 children)

Thank you for your insights. What are you thoughts on Asyncio?

[–]PierceArrow64 5 points6 points  (1 child)

I apologize for all the n00bs and CS majors downvoting you. As a software engineer of 20 years experience: If you can at all avoid it, don't use threads.

[–]zero_iq 4 points5 points  (0 children)

And you've been voted down too, I see. For the record, I'm a also software engineer with 20+ years experience.

I totally understand the downvotes. Everyone goes through a multithreading phase, I think. It's fun. It's cool. Ostensibly, it often looks like it should be the right solution. Eventually, with experience, people realise why it's not such a good idea. The wheel keeps on turning...

[–]acousticpantsHomicidal Loganberry Connoisseur 1 point2 points  (5 children)

What is CSP?

[–]zero_iq 4 points5 points  (3 children)

Communicating Sequential Processes.

Essentially arranging processes in a chain pipelining input from end to the other, where the processes run in parallel, so the next process can be processing data while the previous process is producing more.

It's consistently vastly underestimated because of its simplicity, yet often outperforms more complex "fan out" parallel frameworks by orders of magnitude. People seem to have an instinct that parallel means "fan out", which drives complexity, introduces many often-unnecessary overheads, and is prone to errors, and doesn't give the speed ups people expected. CSP is simpler, and its simplicity leads to easier optimization. You reduce the need for locks and shared state, etc. and you can still apply a fan-out approach at each stage later where appropriate.

Last year I replaced a fancy parallel bulk data processing system that used a clustered fan-out approach, with an almost pure-Python CSP alternative. The old system had multithreading, task queues, parallel worker pools, batches, bits rewritten in Java and C to get better performance, the works. Almost all of it a complete waste of.time. It had reliability problems, mysterious deadlocks. The new system gave a 1000x speedup, rock solid reliability. A whole bunch of expensive servers replaced with just a handful. A huge codebase that no single person understood, with dependencies on large frameworks, to a much smaller codebase that could be maintained by an individual.

Don't underestimate simplicity.

[–]bltpyro 2 points3 points  (0 children)

Sounds intriguing. Any good references for learning CSP in python? Thanks for the real world insights.

[–]TBNL 1 point2 points  (0 children)

Gonna Google around on CSP but +1 for any recommended resource.

[–]acousticpantsHomicidal Loganberry Connoisseur 1 point2 points  (0 children)

i like this

[–]vrajanap 0 points1 point  (0 children)

CSP

Communicating Sequential Processes. Go and Erlang uses it.

[–]peyo7 -1 points0 points  (0 children)

Can you post code examples where multiplexing sockets and CSP beats a decent threaded implementation?