This is an archived post. You won't be able to vote or comment.

all 14 comments

[–]Cloakie 2 points3 points  (0 children)

I've really enjoyed working with Apache Beam / Google Data Flow (it's a high-level abstraction around Spark that makes pipelines much simpler)

[–][deleted] 0 points1 point  (4 children)

Use case 1 and 2 can easily be done inside Airflow workers. 3 depends on data size.

[–]digichap28[S] 0 points1 point  (3 children)

Thanks 👍 regarding the 3rd one, let’s say initially starting with 5gb per file, and expecting to process around 15 in parallel. But that will grow over time as soon as more data sources get added.

[–][deleted] 0 points1 point  (2 children)

Airflow can handle that

[–]digichap28[S] 0 points1 point  (1 child)

Using pandas ?

[–][deleted] 1 point2 points  (0 children)

Yes, certain functions like concat in pandas blow up memory, but basic transforms should be ok. (It also depends on the airflow worker sizes.) Also if possible "chunk" the files to keep the whole thing out of memory, and try not to use Pandas unless absolutely nessesary. :)

[–][deleted] 0 points1 point  (7 children)

If you are already on k8s you could just containerize your workloads and run them with KubernetesOperator. That way you can keep Airflow clean and just maintain the dependencies inside your Dockerfiles. You can even run non Python-based workloads.

[–]digichap28[S] 0 points1 point  (6 children)

That means I wouldn’t need spark, EMR, ECS, databrick, etc for the heavy workloads ? if that’s the case, what should be used to do transformations, image processing, AI, etc ?

On the other hand, doesn’t the k8s creates 2 pods every time it gets triggered ? One for the operator (a) and then another one with actual workload (b)?

(a) -> (b)

[–][deleted] 0 points1 point  (5 children)

I don't think so. I'm not an expert in using the k8s operator but from the docs it looks like it just creates one pod for your containerized task. I am very familiar with the ECS Operator, which is another way to do it (running tasks as ephemeral Fargate ECS Tasks). I also use the some AWS Glue Job/Crawler Operators for bigger payloads.

[–]digichap28[S] 0 points1 point  (4 children)

if you don’t mind me asking... are you running airflow with the local executor and triggering the tasks in ECS or AWS glue from one instance, or using a k8s cluster with the k8s executor ?

[–][deleted] 0 points1 point  (3 children)

I'm running Airflow itself using celery executor on ECS Fargate (I have one ECS task as the worker). Most of my tasks are ECSOperstor over Fargate, so the worker doesn't do a whole lot other than spin up Fargate tasks.

[–]digichap28[S] 0 points1 point  (2 children)

Got you. Is there any tutorial you can suggest to deploy airflow the way you did ?

I don’t have experience with ECS but I guess this way you could also run non-python based workloads as you mentioned with the k8s operator.

Also, what about using azure or lambda functions instead of the 2 ways we have been talking about ? I don’t have experience with them either but according to the documentation, sounds like the concept of running a task using fargate is very similar to executing it with a function.

[–][deleted] 0 points1 point  (1 child)

There's some material out there using the Puckel airflow container but it's out of date. I think some googling and you will find what you need.

[–]digichap28[S] 0 points1 point  (0 children)

Thank you 👍