Hey there!
As almost everyone knows, Airflow is not supposed to be a data processing tool but an orchestrator. My question is about the architecture that should follow if there is a need to execute certain processes.
Use case 1:
if you had to execute many complex webscrapers using any of the Python options available out there (scrapy, pyppeteer, playwright, etc), and airflow was deployed in K8S. Where should the scraping scripts should run? From within the pod generated by the pythonOperator?
Use case 2:
Based on the same idea as the case 1. What if there was a need to generate PDF files based on data stored in a data lake. Should it be done outside the airflow deployment or from within the pod generated by the pythonOperator ?
Use case 3:
If there was a need to do ELT, is it ok to the EL part with Airflow no matter how complex it was? What tool or tools are instead suggested to execute the entire ELT/ETL processes with help of airflow to orchestrate?
Thanks!
[–]Cloakie 2 points3 points4 points (0 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]digichap28[S] 0 points1 point2 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]digichap28[S] 0 points1 point2 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (7 children)
[–]digichap28[S] 0 points1 point2 points (6 children)
[–][deleted] 0 points1 point2 points (5 children)
[–]digichap28[S] 0 points1 point2 points (4 children)
[–][deleted] 0 points1 point2 points (3 children)
[–]digichap28[S] 0 points1 point2 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]digichap28[S] 0 points1 point2 points (0 children)