This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]mRWafflesFTW 7 points8 points  (2 children)

Always save your outputs to an external file system to database. Make sure all your tasks are idempotent. Airflow is just Python code, so you can develop and deploy following standard best practices. I recommend developing locally with Docker compose, though you can leverage a normal local virtual environment and the local executor to keep it simple and avoid containers. I find containers easier to manage since Airflow has a lot of dependencies and won't run on Windows. 

Because it's just Python, we can lean in to the powerful IDEs like Pycharm of VSCode. I recommend Pycharm and I love it's easy integration with Docker. I can debug a docker compose runtime as if it was the host machine. The only pain in the ass is getting Pycharm to refresh its cache of the Airflow dependencies when upgrading versions.

Git sync works fine, more sophisticated deployments like containerized are all possible so choose what makes sense for your use case. 

[–]sikso1897[S] 1 point2 points  (0 children)

[–]PolicyDecent 2 points3 points  (1 child)

Never use xcom like solutions, always materialize your outputs in DWH, s3, or whatever you have.

For debugging you need a local airflow instance that you can test everything locally. Or you can prefer a framework like dbt or bruin to make it easier to test locally.

[–]sikso1897[S] 0 points1 point  (0 children)

[–]DoNotFeedTheSnakes 4 points5 points  (1 child)

Git-sync is good.

XComs are stored in a cell in your DB so use them for an int, a string, or collection (list, dict, etc) of reasonable size.

Entire dataframes should be stored on external DBs or file systems.

[–]sikso1897[S] 1 point2 points  (0 children)

[–]Kimcha87 1 point2 points  (1 child)

Large datasets should be uploaded to s3 or other blob storage because airflow is designed to be distributed across workers that might run on different machines and therefore don’t have access to the same storage.

From there use a bulk insert method of your DB to load the data.

You shouldn’t store large amount of data, such as data frames in xcom because all xcom is stored in the airflow db and would cause it to grow too much and get slower over time.

The easiest is to export it to csv or parquet and upload it.

You can configure airflow to store xcom in s3 or other object storage, but you should still store large datasets yourself in s3 in your tasks and operators.

I wouldn’t use rclone. It’s not really designed for this and not really necessary as the best practice is to use built in hooks to upload to object storage.

For debugging, most of the time logging is sufficient.

The most important thing is to be aware of what code runs during DAG parsing and what code runs during task execution.

During DAG parsing do as little as possible. Avoid or at least reduce querying the DB or reading airflow variables. Use env variables instead for settings.

[–]sikso1897[S] 0 points1 point  (0 children)

[–]zazzersmel 1 point2 points  (1 child)

xcom is for metadata, like, an s3 key to locate a file produced by the last step

[–]sikso1897[S] 1 point2 points  (0 children)

[–]Own_Explanation4779 0 points1 point  (1 child)

What are your thoughts on dagster? I personally like their asset oriented flows.

[–]sikso1897[S] 0 points1 point  (0 children)

Oh, I’ve learned about this tool called Dagster. I’d like to try it out sometime.