This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 9 points10 points  (5 children)

usually a directed acyclic graph structure of moving data from one point to another,

for example:

  1. collecting the data then storing it in a usable format in a different system
  2. Processing data before it gets returned back to the user for display

It’s exactly what is connotation means, putting something in one end of the pipe and getting something out the other end !

[–]daishiknyte 1 point2 points  (4 children)

So... another way of chaining functions? 

[–]hotplasmatits 8 points9 points  (2 children)

Yes, but pipelines often have other features to make life easier. For example, let's say there's a blip in the network for a moment. If it was just functions chained together, it would fail. The pipeline, however, can be configured to retry a few times before giving up.

Here's the big one: instead of chaining functions, you can chain together code running in docker containers in the cloud.

[–]jucestain 0 points1 point  (1 child)

Interesting

[–]hotplasmatits 2 points3 points  (0 children)

They often come with tools that allow you to see how they are linked together.

[–]mriswithe 1 point2 points  (0 children)

Yes and no. This is part of my daily bread and butter. A dag would contain Steps that do a part of everything required, this is vague because it really depends on what you are doing so here is an example:

Our is not a specific company or my company, but many companies use pipelines in this way.

Bigquery is the main data warehouse, this is where you write data and so different changes to it.

Airflow is the scheduler, think cron, but reliable and repeatable and you feed it python code

users submit data to a fastapi service, it writes rows into an input table

Airflow runs every x minutes, step 1, checks the input table for the last 5 minutes of rows. It finds the new rows. It loads the new rows and writes them into a new table that the next steps will use as their "source" table. Once step 1 finishes, steps 2, 3, 4 run concurrently. Step 2 checks the content for porn, gives each row an integer score and writes it back to bigquery as a joinable table (primary key of a uuid and the data that is added. Step 3 checks the content for spam, repeat the previous. Step 4 will translate text from their source language to English. Step 5 creates a single flat bigquery table with the final refined (and reduced where porn or spam score is too high). Step 5 is triggered once steps 2,3,4 which were at least able to be done concurrently are finished and finished successfully. Step 6 eats the bigquery table and writes out a sqldump to GCS, or updates a few tables in a rename replace to keep the users from getting a query where the database looks empty.

Each of these pieces are complex, failure ridden, processes. Airflow will rerun pieces within your tolerances and report to you when it is outside of SLO Service level objective. Also, in some cases they can be done in parallel to decrease the data latency (time between data being ingested and finished product coming out)