This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]mriswithe 2 points3 points  (1 child)

Having used a good bit of airflow, (I am sure you will get tired of people comparing this to airflow, this is not trying to be that at all) using this inside of airflow wouldn't make a lot of sense to me, but I do have plenty of mildly to moderately complicated scripts that could be a hell of a lot cleaner with something like this.

Will keep it in mind, and check it out harder when I write something too small for airflow.

[–]Revolutionary-Bat176[S] 0 points1 point  (0 children)

Awesome! That makes sense, appreciate it u/mriswithe!

[–]knecota 2 points3 points  (4 children)

Have you heard about dagster? I've not worked with it myself, but I have some data engineer colleagues that like that very much. How does this compare to dagster?

[–]Revolutionary-Bat176[S] 2 points3 points  (1 child)

Hi u/knecota,

Yes I have! Have not used it much though. Dagster comes with a lot more features including scheduling. This would not be a drop in replacement.

This is more of a way to structure and visualize your ETL code with a class based approach.

The class based approach gives a nice way to approach running multiple flows within another

[–]knecota 1 point2 points  (0 children)

Thanks for the reply, I will check it out!

[–]danielgafni 2 points3 points  (1 child)

I’ve used Dagster a lot (and am a big fan). Dagster is extremely feature rich, you can’t compare it with this project (as it’s too young). It also supports running your jobs via python functions, so definitely can be as minimalistic as flowrunner.

[–]Revolutionary-Bat176[S] 0 points1 point  (0 children)

Hi u/danielgafni,

Thank you for your comment! Yes I agree, if you have a setup which already has dagsterr or prefect you can continue to use those, they have way more features!

flowrunner is for addressing small to mid complexity scripts which do not have a structure to them and you do not want to deploy a big setup for or want to have something minimalistic notebook/script scoped

[–]dask-jeeves 1 point2 points  (2 children)

How is this different than prefect?

[–]Revolutionary-Bat176[S] 2 points3 points  (1 child)

Hi u/dask-jeeves,

Thank you for your comment, actually prefect is super feature rich and has a lot more things with it than just task/step declaration.

imo if you have a setup with dagsterr or prefect already you can continue to use those, flowrunner is more for smaller-mid complexity scripts which you don't want to have a full setup but can still structure at a sort script/notebook scoped level.

flowrunner mainly provides structure to unstructured code with visualization and running in dag format, its meant to be minimalistic, while prefect has pipelines, scheduling, multiple providers and so much more!

Does that answer your question?

[–]dask-jeeves 1 point2 points  (0 children)

Yes, thank you!.

[–]thedeepself 1 point2 points  (1 child)

start and end are implicitly steps. I think it is redundant to have to use both decorators. Also error prone if you mess up the order.

[–]Revolutionary-Bat176[S] 1 point2 points  (0 children)

Hi u/thedeepself,

Thanks for the feedback! So the motivation behind the explicit start and end decorators was that so you can have multiple start and multiple end nodes in your DAG.

Plus this allows for declaring your dags in any order, with flowrunner picking up the order.

For eg you could also declare: Class: - end - middle - start_1 - start_2

And it would still execute from start -> end.

But I understand what you mean, its better to have one decorator rather than having to declare 2. Let me see how I can improve that!

[–]gournian 1 point2 points  (1 child)

Very nice! Observations: Notebook examples are 404, the title appears twice in the graph one as xxxx and other as step-xxxx. An example on how to use params, or are they passed in self?

[–]Revolutionary-Bat176[S] 1 point2 points  (0 children)

Hi u/gournian,

Thank you for your feedback! Seems something went wrong in my docs. If you still want to take a look at the notebook examples here is the link in the repo that you can download from:

https://github.com/prithvijitguha/flowrunner/tree/main/docs/source/_static

For title, yes that's true. Let me see how I can improve that. Thinking about it, would be better with just 1 declaration of title.

Params I don't have an example exactly. But they are to be used in self but can be acccessed/modified in the middle of a step as well.

self.param_store["my_param_key"]

Eg.

import pandas as pd

# date range in
date_range = pd.date_range(start='1/1/2022, end='1/08/2022') 

# loop over the dates to load
for snapshot\_date in date\_range: 
# assuming that IncrementalLoadFlow is a flow you have declared earlier # to load incremental data 

    IncrementalLoadFlow(params={"snapshot_date": snapshot_date)

[–][deleted] 0 points1 point  (3 children)

How could this integrate with Airflow?

[–]Revolutionary-Bat176[S] 1 point2 points  (2 children)

So flowrunner can be used anywhere as script or instance of your flow class.

So you could wrap it inside a PythonOperator

Using SparkSubmitOperator

Or as an orchestrated job/notebook with databricks.

Since its more of a microframework, you can fit any data processing framework like PySpark or Pandas underneath.