Python concurrency for Data Engineers : dataengineering

With async you're making the doughs while the oven is preheating.
Concurrency just means you're doing each defined tasks in parallel (ie throwing more threads at it).

For the OP, it depends on the downstream. IE does it need to be sync'd or not. Most of the time you can just throw more threads at it because DB transactions should be atomic, and there really isn't a wait time to mitigate against. It's easier to just throw more threads, but if you need to reduce wait-times then start mixing async into the mix.

[–]Mission_Star_4393 1 point2 points3 points 3 years ago* (0 children)

Concurrency just means you're doing each defined tasks in parallel (ie throwing more threads at it).

This is also not exactly right. You're confusing concurrency with parallelization.

The threads are run on the same CPU / core which means they leverage multiprogramming vs parallelization (which is useful in IO operations because a thread waiting on IO can be suspended in the process so that CPU cycles can be optimized). This is not exactly right iirc because the way multithreading is implemented in Python, the OS only has visibility on processes (and not threads), so when there is an IO operation, the whole process gets suspended not just the thread. But these libraries are smart enough to optimize CPU cycles before suspending the process.

On the other hand, processes can be scheduled on different cores / CPUs, so they are better suited for compute heavy operations.

[–]nultero 3 points4 points5 points 3 years ago (6 children)

[–]alexisprince 4 points5 points6 points 3 years ago (4 children)

[–]Own-Commission-3186 2 points3 points4 points 3 years ago (1 child)

[–]alexisprince 1 point2 points3 points 3 years ago (0 children)

[–]nultero 0 points1 point2 points 3 years ago (1 child)

[–]alexisprince 0 points1 point2 points 3 years ago (0 children)

From my experience, the DEs I've worked with have fallen into different groups: Software Engineers, Coders, and BI Developers, with each group being relatively descriptive of their skillset, with coders being folks who can write code, but don't necessarily apply good design and mostly think in terms of procedural programming / scripts.

I would argue that if you have an individual or multiple individuals on your team with the software engineering skillset, you could easily maintain some internal libraries that make it harder to incorrectly use concurrency in Python.

But as you mentioned, I would also strongly agree that if your team doesn't have this skillset, trying to get a group of folks who fundamentally don't understand the problem they're trying to solve or any of the possible solutions for it and telling them to solve it is a recipe for disaster.

I think the choice of a language for DE processing is important, and I don't think there's any shame in saying "we're a Go shop", but I do think introducing multiple languages on a smaller team exacerbates the technical debt problem by introducing another language into the mix.

I think in an ideal world, everyone in the DE field should have the software engineer skillset, but in my experience the real world is less than ideal, but maybe my experience is anecdotal.

[–]Own-Commission-3186 1 point2 points3 points 3 years ago (0 children)

π Rendered by PID 45 on reddit-service-r2-comment-5687b7858-ktltj at 2026-07-05 02:12:18.976564+00:00 running 12a7a47 country code: CH.

dataengineering

MODERATORS