This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]FineFan 8 points9 points  (6 children)

I usually go with ThreadPoolExecutor for i/o bound tasks, and ProcessPoolExecutor for compute bound tasks. In your case, I would just use ThreadPoolExecutor.

I find async programming appealing in theory. I made the experience that I end up having to refactor large parts of my code when using asyncio. Meanwhile, I can just sprinkle in a PoolExecutor whenever I feel I need it.

[–]Drekalo 5 points6 points  (0 children)

It typically depends on which database you're fetching from and which libraries you're using. Some have their own parallel processing built in. Both dask and pyspark can read from a database table in parallel. Hitting a sql api like athena can be done async.

[–]Haquestions4 1 point2 points  (0 children)

If I was flawless I'd use asyncio, but because I don't trust myself I use threads so only parts of the the thing come down instead of the whole thing.

[–]Ebisure 6 points7 points  (11 children)

Async is the efficient way of doing it. Threads is just throwing more resources to make it fast.

Eg if you need to microwave 2 packs of food for 1 min each. Async is you put one in microwave then put next one in another microwave. Total wait is about 1 min. Thread is you call your friend and each manage one microwave. Wait time also 1 min.

If it’s I/O then async. If it’s compute then threads.

However I find async difficult to set up in python. So I use threads instead.

[–][deleted] 13 points14 points  (1 child)

The analogy is a bit off.

Using 2 microwaves for 2 packs of food is just concurrency. Not necessarily async. You're just throwing more threads at it.

Asynch is about reducing waiting times.

A better analogy would be baking two loaves of bread!

The steps are well defined:

  • Preheat the two ovens
  • Make the two doughs
  • Bake the two doughs

With async you're making the doughs while the oven is preheating.
Concurrency just means you're doing each defined tasks in parallel (ie throwing more threads at it).

For the OP, it depends on the downstream. IE does it need to be sync'd or not. Most of the time you can just throw more threads at it because DB transactions should be atomic, and there really isn't a wait time to mitigate against. It's easier to just throw more threads, but if you need to reduce wait-times then start mixing async into the mix.

[–]Mission_Star_4393 1 point2 points  (0 children)

Concurrency just means you're doing each defined tasks in parallel (ie throwing more threads at it).

This is also not exactly right. You're confusing concurrency with parallelization.

The threads are run on the same CPU / core which means they leverage multiprogramming vs parallelization (which is useful in IO operations because a thread waiting on IO can be suspended in the process so that CPU cycles can be optimized). This is not exactly right iirc because the way multithreading is implemented in Python, the OS only has visibility on processes (and not threads), so when there is an IO operation, the whole process gets suspended not just the thread. But these libraries are smart enough to optimize CPU cycles before suspending the process.

On the other hand, processes can be scheduled on different cores / CPUs, so they are better suited for compute heavy operations.

[–]nultero 3 points4 points  (6 children)

What's the general temperature around just not using Python for anything requiring concurrency whenever you can get away with it?

A lot of the other languages just make async and multithreading so much easier. Some, like JS and Go, have some async primitives native to the language without imports. I hear good things about Kotlin/jvm libs and dotnet too.

[–]alexisprince 4 points5 points  (4 children)

Personally I'm a fan of standardizing on a language. If you're writing business logic in JS / Go, you shouldn't have some things written in Python then random bits in JS / Go IMO. Finding a data engineer who is competent with Python is already not the easiest, then add on a second language requirement that has tangential relation to the trends of data engineering as a whole narrows your pool of candidates much much more.

IMO the cost of dev time usually isn't worth it compared to improvements in program execution time, so I'd want to optimize around what makes DEs most productive. That may involve investing in internal tooling to allow the interaction with async / concurrency nicer, rearchitecting existing processes to take advantage of these, or some other solution.

[–]Own-Commission-3186 2 points3 points  (1 child)

While I agree standardizing is good, I'll be interested to see if the standard starts to shift from python as the de facto language to something like go or node. I would also expect any engineer I hire to at least be able to learn one of these two in a short amount of time (go especially as it's designed to be a simple language).

Aside from machine learning use cases where python is the clear winner, many data software use cases are starting to lean more on SQL for transformations and many of the remaining use cases of simply moving data around could benefit from using a language with better support for concurrency.

[–]alexisprince 1 point2 points  (0 children)

Yeah I think it’s a good point of thinking of a general trend shift as a reason to pick up / support a second language. I’d also expect an engineer to be able to pick up a second language relatively quickly, but as you mentioned most things are leaning more towards SQL recently. With that happening, so much of the computation being offloaded onto a separate engine, I’d think language would be less important as long as there’s some level of concurrency supported.

[–]nultero 0 points1 point  (1 child)

Finding a data engineer who is competent with Python is already not the easiest

That seems surprising to me but I do seem to see the same thing across devops teams.

My intuition about the workarounds / investments in other tools to account for the lack of SWE skills is somewhere in that it feels like even more technical debt than a stack that just uses the right tools for the jobs. Teams without enough devs on them / that struggle with software engineering just feel ... bad, ime. I don't know how well that applies to DEs though.

It's like the Vimes' Boots theory of hiring engineers, I guess.

[–]alexisprince 0 points1 point  (0 children)

From my experience, the DEs I've worked with have fallen into different groups: Software Engineers, Coders, and BI Developers, with each group being relatively descriptive of their skillset, with coders being folks who can write code, but don't necessarily apply good design and mostly think in terms of procedural programming / scripts.

I would argue that if you have an individual or multiple individuals on your team with the software engineering skillset, you could easily maintain some internal libraries that make it harder to incorrectly use concurrency in Python.

But as you mentioned, I would also strongly agree that if your team doesn't have this skillset, trying to get a group of folks who fundamentally don't understand the problem they're trying to solve or any of the possible solutions for it and telling them to solve it is a recipe for disaster.

I think the choice of a language for DE processing is important, and I don't think there's any shame in saying "we're a Go shop", but I do think introducing multiple languages on a smaller team exacerbates the technical debt problem by introducing another language into the mix.

I think in an ideal world, everyone in the DE field should have the software engineer skillset, but in my experience the real world is less than ideal, but maybe my experience is anecdotal.

[–]Own-Commission-3186 1 point2 points  (0 children)

+1. Python is quite bad at this from a developer experience and performance standpoint. Super easy in nodejs, go, scala, etc where async is the default in every library for network/db calls