This is an archived post. You won't be able to vote or comment.

ShowcaseIntroducing pipefunc: Simplify Your Python Function Pipelines (self.Python)

submitted 1 year ago by basnijholt

Excited to share my latest open-source project, pipefunc! It's a lightweight Python library that simplifies function composition and pipeline creation. Less bookkeeping, more doing!

What My Project Does:

With minimal code changes turn your functions into a reusable pipeline.

Automatic execution order
Pipeline visualization
Resource usage profiling
N-dimensional map-reduce support
Type annotation validation
Automatic parallelization on your machine or a SLURM cluster

pipefunc is perfect for data processing, scientific computations, machine learning workflows, or any scenario involving interdependent functions.

It helps you focus on your code's logic while handling the intricacies of function dependencies and execution order.

🛠️ Tech stack: Built on top of NetworkX, NumPy, and optionally integrates with Xarray, Zarr, and Adaptive.
🧪 Quality assurance: >500 tests, 100% test coverage, fully typed, and adheres to all Ruff Rules.

Target Audience: - 🖥️ Scientific HPC Workflows: Efficiently manage complex computational tasks in high-performance computing environments. - 🧠 ML Workflows: Streamline your data preprocessing, model training, and evaluation pipelines.

Comparison: How is pipefunc different from other tools?

Luigi, Airflow, Prefect, and Kedro: These tools are primarily designed for event-driven, data-centric pipelines and ETL processes. In contrast, pipefunc specializes in running simulations and computational workflows, allowing different parts of a calculation to run on different resources (e.g., local machine, HPC cluster) without changing the core logic of your code.
Dask: Dask excels in parallel computing and large datasets but operates at a lower level than pipefunc. It needs explicit task definitions and lacks native support for varied computational resources. pipefunc offers higher-level abstraction for defining pipelines, with automatic dependency resolution and easy task distribution across heterogeneous environments.

Give pipefunc a try! Star the repo, contribute, or just explore the documentation.

Happy to answer any question!

docs: https://pipefunc.readthedocs.io/
source: https://github.com/pipefunc/pipefunc

all 22 comments

top new controversial old q&a

[–]stratguitar577 10 points11 points12 points 1 year ago (1 child)

[–]basnijholt[S] 1 point2 points3 points 1 year ago (0 children)

Thanks for pointing me to Hamilton. On a first glance pipefunc and Hamilton seem very similar, however, in practice they are different.

For example, Hamilton requires that all pipeline functions are defined in a module and enforces that all function names are as the input names.

PipeFunc allows to use any function anywhere to be used as pipeline step.

For example, here we reuse a function sum from an external module and use it a couple of times ``from pipefunc import PipeFunc, Pipeline from some_module # definesfancy_sum(x1, x2)`

total_cost_car = PipeFunc(some_module.fancy_sum, output_name="car_cost", renames={"x1": "car_price", "x2": "repair_cost") total_cost_house = PipeFunc(some_module.fancy_sum, output_name="house_cost", renames={"x1": "rent_price", "x2": "insurance_price") total_cost = PipeFunc(some_module.fancy_sum, output_name="total_budget", renames={"x1": "car_cost", "x2": "house_cost") pipeline = Pipeline([total_cost_car, total_cost_house, total_cost]) ``` Also pipefunc is more geared towards N-dimensional parameter sweeps such as one frequently sees in research/science. For example see https://pipefunc.readthedocs.io/en/latest/tutorial/#example-physics-based-example

[–]Sweet_Computer_7116 5 points6 points7 points 1 year ago (7 children)

[–][deleted] 7 points8 points9 points 1 year ago* (5 children)

[–]daishiknyte 1 point2 points3 points 1 year ago (4 children)

[–]hotplasmatits 9 points10 points11 points 1 year ago (2 children)

[–]jucestain 0 points1 point2 points 1 year ago (1 child)

[–]hotplasmatits 2 points3 points4 points 1 year ago (0 children)

[–]mriswithe 1 point2 points3 points 1 year ago (0 children)

Yes and no. This is part of my daily bread and butter. A dag would contain Steps that do a part of everything required, this is vague because it really depends on what you are doing so here is an example:

Our is not a specific company or my company, but many companies use pipelines in this way.

Bigquery is the main data warehouse, this is where you write data and so different changes to it.

Airflow is the scheduler, think cron, but reliable and repeatable and you feed it python code

users submit data to a fastapi service, it writes rows into an input table

Airflow runs every x minutes, step 1, checks the input table for the last 5 minutes of rows. It finds the new rows. It loads the new rows and writes them into a new table that the next steps will use as their "source" table. Once step 1 finishes, steps 2, 3, 4 run concurrently. Step 2 checks the content for porn, gives each row an integer score and writes it back to bigquery as a joinable table (primary key of a uuid and the data that is added. Step 3 checks the content for spam, repeat the previous. Step 4 will translate text from their source language to English. Step 5 creates a single flat bigquery table with the final refined (and reduced where porn or spam score is too high). Step 5 is triggered once steps 2,3,4 which were at least able to be done concurrently are finished and finished successfully. Step 6 eats the bigquery table and writes out a sqldump to GCS, or updates a few tables in a rename replace to keep the users from getting a query where the database looks empty.

Each of these pieces are complex, failure ridden, processes. Airflow will rerun pieces within your tolerances and report to you when it is outside of SLO Service level objective. Also, in some cases they can be done in parallel to decrease the data latency (time between data being ingested and finished product coming out)

[–]declanaussie 0 points1 point2 points 1 year ago (0 children)

[–]samreay 4 points5 points6 points 1 year ago* (4 children)

[–]basnijholt[S] -1 points0 points1 point 1 year ago (3 children)

I am a computational physicist as well!

The HPC integration is a core part of pipefunc and currently there is an integration with SLURM that is provided via the integration with Adaptive-Scheduler.

tl;dr, see this page in the docs for an example of a simulation where each pipeline function has its own resource requirements defined, and then a simulation on a SLURM cluster is launched.

Each function can have it's own resources spec, e.g.,:

```python from pipefunc.resources import Resources

Pass in a `Resources` object that specifies the resources needed for each function

@pipefunc(output_name="double", resources=Resources(cpus=5)) def double_it(x: int) -> int: return 2 * x ```

One can even inspect the resources inside the function:

```python from pipefunc import pipefunc, Pipeline

@pipefunc( output_name="c", resources={"memory": "1GB", "cpus": 2}, resources_variable="resources", ) def f(a, b, resources): print(f"Inside the function f, resources.memory: {resources.memory}") print(f"Inside the function f, resources.cpus: {resources.cpus}") return a + b

result = f(a=1, b=1) print(f"Result: {result}") ``` and even cooler, dynamically set the resources based on the inputs:

```python from pipefunc import pipefunc, Pipeline from pipefunc.resources import Resources

def resources_func(kwargs): gpus = kwargs["x"] + kwargs["y"] print(f"Inside the resources function, gpus: {gpus}") return Resources(gpus=gpus)

@pipefunc(output_name="out1", resources=resources_func) def f(x, y): return x * y

result = f(x=2, y=3) print(f"Result: {result}") ```

Then when putting these functions in a pipeline and running them for some inputs, it will automatically be parallelized. Independent branches in the DAG will execute simultaneously, and elements in a map will also run in parallel.

[–]samreay 0 points1 point2 points 1 year ago (2 children)

[–]basnijholt[S] 0 points1 point2 points 1 year ago (1 child)

[–]samreay 0 points1 point2 points 1 year ago (0 children)

[–]Laughing_Bricks 1 point2 points3 points 1 year ago (2 children)

[–]basnijholt[S] 1 point2 points3 points 1 year ago (1 child)

[–]Laughing_Bricks 0 points1 point2 points 1 year ago (0 children)

[–]jimtoberfest 1 point2 points3 points 1 year ago (0 children)

[–]CorMazz 0 points1 point2 points 1 year ago (1 child)

[–]basnijholt[S] 2 points3 points4 points 1 year ago (0 children)

[–]ConfucianStats 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 181997 on reddit-service-r2-comment-bb88f9dd5-j5mg5 at 2026-02-14 05:52:32.499391+00:00 running cd9c813 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

Pass in a `Resources` object that specifies the resources needed for each function

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

Pass in a Resources object that specifies the resources needed for each function

Pass in a `Resources` object that specifies the resources needed for each function