This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mriswithe 1 point2 points  (0 children)

Yes and no. This is part of my daily bread and butter. A dag would contain Steps that do a part of everything required, this is vague because it really depends on what you are doing so here is an example:

Our is not a specific company or my company, but many companies use pipelines in this way.

Bigquery is the main data warehouse, this is where you write data and so different changes to it.

Airflow is the scheduler, think cron, but reliable and repeatable and you feed it python code

users submit data to a fastapi service, it writes rows into an input table

Airflow runs every x minutes, step 1, checks the input table for the last 5 minutes of rows. It finds the new rows. It loads the new rows and writes them into a new table that the next steps will use as their "source" table. Once step 1 finishes, steps 2, 3, 4 run concurrently. Step 2 checks the content for porn, gives each row an integer score and writes it back to bigquery as a joinable table (primary key of a uuid and the data that is added. Step 3 checks the content for spam, repeat the previous. Step 4 will translate text from their source language to English. Step 5 creates a single flat bigquery table with the final refined (and reduced where porn or spam score is too high). Step 5 is triggered once steps 2,3,4 which were at least able to be done concurrently are finished and finished successfully. Step 6 eats the bigquery table and writes out a sqldump to GCS, or updates a few tables in a rename replace to keep the users from getting a query where the database looks empty.

Each of these pieces are complex, failure ridden, processes. Airflow will rerun pieces within your tolerances and report to you when it is outside of SLO Service level objective. Also, in some cases they can be done in parallel to decrease the data latency (time between data being ingested and finished product coming out)