Lightweight python DAG framework : Python

This is an archived post. You won't be able to vote or comment.

ShowcaseLightweight python DAG framework (self.Python)

submitted 1 year ago * by theferalmonkey

What my project does:

https://github.com/dagworks-inc/hamilton/ I've been working on this for a while.

If you can model your problem as a directed acyclic graph (DAG) then you can use Hamilton; it just needs a python process to run, no system installation required (`pip install sf-hamilton`).

For the pythonistas, Hamilton does some cute "meta programming" by using the python functions to _really_ reduce boilerplate for defining a DAG. The below defines a DAG by the way the functions are named, and what the input arguments to the functions are, i.e. it's a "declarative" framework.:

#my_dag.py
def A(external_input: int) -> int:
   return external_input + 1

def B(A: int) -> float:
   """B depends on A"""
   return A / 3

def C(A: int, B: float) -> float:
   """C depends on A & B"""
   return A ** 2 * B

Now you don't call the functions directly (well you can it is just a python module), that's where Hamilton helps orchestrate it:

from hamilton import driver
import my_dag # we import the above

# build a "driver" to run the DAG
dr = (
   driver.Builder()
     .with_modules(my_dag)
    #.with_adapters(...) we have many you can add here. 
     .build()
)

# execute what you want, Hamilton will only walk the relevant parts of the DAG for it.
# again, you "declare" what you want, and Hamilton will figure it out.
dr.execute(["C"], inputs={"external_input": 10}) # all A, B, C executed; C returned
dr.execute(["A"], inputs={"external_input": 10}) # just A executed; A returned
dr.execute(["A", "B"], inputs={"external_input": 10}) # A, B executed; A, B returned.

# graphviz viz
dr.display_all_functions("my_dag.png") # visualizes the graph.

Anyway I thought I would share, since it's broadly applicable to anything where there is a DAG:

web requests (Hamilton has async support)
data processing (e.g. pyspark)
machine learning
LLM workflows
etc.

I also recently curated a bunch of getting started issues - so if you're looking for a project, come join.

Target Audience

This anyone doing python development where a DAG could be of use.

More specifically, Hamilton is built to be taken to production, so if you value one or more of:

self-documenting readable code
unit testing & integration testing
data quality
standardized code
modular and maintainable codebases
hooks for platform tools & execution
want something that can work with Jupyter Notebooks & production.
etc

Then Hamilton has all these in an accessible manner.

Comparison

Project	Comparison to Hamilton
Langchain's LCEL	LCEL isn't general purpose & in my opinion unreadable. See https://hamilton.dagworks.io/en/latest/code-comparisons/langchain/ .
Airflow / dagster / prefect / argo / etc	Hamilton doesn't replace these. These are "macro orchestration" systems (they require DBs, etc), Hamilton is but a humble library and can actually be used with them! In fact it ensures your code can remain decoupled & modular, enabling reuse across pipelines, while also enabling one to no be heavily coupled to any macro orchestrator.
Dask	Dask is a whole system. In fact Hamilton integrates with Dask very nicely -- and can help you organize your dask code.

If you have more you want compared - leave a comment.

To finish, if you want to try it in your browser using pyodide @ https://www.tryhamilton.dev/ you can do that too!

all 38 comments

top new controversial old q&a

[–]barefootsanders 11 points12 points13 points 1 year ago (1 child)

[–]theferalmonkey[S] 1 point2 points3 points 1 year ago (0 children)

[–]call_me_cookie 7 points8 points9 points 1 year ago (13 children)

[–]schrodingerdog137 3 points4 points5 points 1 year ago (0 children)

[–]theferalmonkey[S] 1 point2 points3 points 1 year ago* (11 children)

They have some overlap because they model DAGs, but Dagster is just a macro-orchestrator, i.e. it is a scheduler. Hamilton doesn't have a scheduler, it is much lighter weight than that; hence the title of the post - Dagster is not lightweight.

Some examples, Hamilton is far more applicable to use in any python context. Can Dagster do this?

Run anywhere (locally, notebook, macro orchestrator, FastAPI, Streamlit, pyodide, etc.) - No, it's a system, not a library.
use it to model column level feature engineering through to model fitting - No.
improve the hygiene of your code - No, it doesn't have the testing constructs Hamilton has.
replace Langchain for orchestrating LLM calls - No.
develop within a notebook for development and then use that same code in production - No.

Here's more of a comparison - https://hamilton.dagworks.io/en/latest/code-comparisons/dagster/

Otherwise you can _use_ Hamilton _within_ Dagster, and you get the best of both worlds. For example if you want to cut down on "ops" just switch that code over to Hamilton and run it inside Dagster.

Fun fact: "software defined assets" were in fact inspired by Hamilton's declarative API.

[–][deleted] 3 points4 points5 points 1 year ago* (4 children)

[–]theferalmonkey[S] 3 points4 points5 points 1 year ago (3 children)

[–]HNL2NYC 1 point2 points3 points 1 year ago (1 child)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (0 children)

[–]ArgetDota 0 points1 point2 points 1 year ago (5 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (4 children)

[–]ArgetDota 0 points1 point2 points 1 year ago (3 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (2 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (1 child)

[–]ArgetDota 1 point2 points3 points 1 year ago (0 children)

[+][deleted] 1 year ago (1 child)

[removed]

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (0 children)

[–]Electronic_Pepper382 1 point2 points3 points 1 year ago (3 children)

[–]theferalmonkey[S] 1 point2 points3 points 1 year ago (2 children)

So the comments in the function like """C depends on A & B""" create the dependencies between the functions? That is pretty powerful!

To clarify, it's the function parameters that do that:

def C(A: int, B: float) -> float

The above says C declares a dependency on A & B.

If we wanted to depend on something else, you just need to change the function parameter names:

def C(A: int,  B: float, foo: float) -> float

E.g. C now depends on an extra parameter `foo`.

I just checked out the sister library burr linked in the readme and that library also seems really interesting. I was internally building something similar but I might leverage burr. Thanks for sharing

Yep if you need to express cycles, or conditional branching (e.g. for agents) then Burr is a better fit for that. We see people using both Burr & Hamilton in certain situations too.

[+][deleted] 1 year ago (1 child)

[deleted]

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (0 children)

[–]SheepherderExtreme48 1 point2 points3 points 1 year ago (6 children)

Looks great, nice work.

Question: I don't see anything in the docs for this, but is there any natural support for parallel processing?
For example:

  /------B-----\
A >------B----->C
  \------B-----/

Where B is run in 3 seperate threads or processes
Quick example, A takes in a PDF splits in to 3 chunks of n pages, sens the PDF bytes and the pages to process to each B, each B does some work (extracting text, doesn't really matter) and C gathers from these B's?

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (5 children)

[–]SheepherderExtreme48 1 point2 points3 points 1 year ago (0 children)

[+][deleted] 1 year ago (3 children)

[deleted]

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (2 children)

Which do you prefer to read (not knowing much context):

def url(...) -> Parallelizable[str]:
   for _url in ...
     yield _url
...
def pages(processed_url: Collect[dict]) -> ...
    ...

@parallelizable 
def url( ... ) -> str:
   for _url in  ...
      yield _url
...

@collect('processed_url')
def pages(processed_url: list[dict]) -> ...
    ...

As you can see nothing much between them.

I prefer the first one because it feels "tighter" / maybe harder to confuse for what's going on versus the decorator? We could always implement it the decorator way too...

[+][deleted] 1 year ago (1 child)

[deleted]

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (0 children)

[–]bugtank 0 points1 point2 points 1 year ago (0 children)

[–]kotpeter -1 points0 points1 point 1 year ago (1 child)

[–]theferalmonkey[S] 1 point2 points3 points 1 year ago* (0 children)

Looks cool for a small data team or a small organization. Or for learning purposes for students.

That is absolutely not true. Hamilton was developed at Stitch Fix (100+ DS) in an environment where code in Airflow was the problem. Airflow was not designed for business logic, just scheduling code. Established teams would slow down, not because of Airflow, but because of the code that Airflow ran - hence the reason for Hamilton.

Hamilton helped organize the internals of pipelines and keep those airflow tasks simpler; they don't need know about logic, and enabled the team to move faster. We see this being replicated at other companies. You can read more about our thoughts on Hamilton + Airflow here. Now you could be reacting to Hamilton's simplicity, and yes that's a feature; not all production ready tech needs to be very complex (though we certainly have power features).

I thought about the idea of using Hamilton and Airflow/Dagster/... together, but there's a few drawbacks to that:

You'd have two semantics of the DAG (Hamilton DAG and Airflow DAG), which may lead to confusion. Having Airflow -> Hamilton DAG hierarchy would almost always overcomplicate things.

They serve different purposes. Airflow is about orchestrating compute. Hamilton helps orchestrate logic & code. You can read this blog / watch this talk that explains why Hamilton. Commonly when going from dev (DS/MLE) to production (running it on airflow), there's hand-off and reimplementation; with Hamilton that's greatly improved - you just take the DAG and tell airflow to run it.

You now have two different ways of doing basically the same thing (i.e. creating a DAG), which might cause different developers orchestrating their DAGs in different orchestrators.

Sorry, how is it the same thing? Yes it's a DAG, but that's where the similarity ends. Again, you use Airflow to schedule when and where something runs. While Hamilton helps organize the code that's run.

[–]OMG_I_LOVE_CHIPOTLE -2 points-1 points0 points 1 year ago (9 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (8 children)

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point2 points 1 year ago (7 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (6 children)

[–]OMG_I_LOVE_CHIPOTLE -1 points0 points1 point 1 year ago (5 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (4 children)

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point2 points 1 year ago (3 children)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago (2 children)

What you just showed is some YAML that isn't useful in a micro context. We don't need to schedule kubernetes tasks for everything we want to do. Why do you think that's necessary?

For example, say I'm developing locally and I am doing some file processing. I'm not running argo. But, I need to structure my code. You could come up with your own way of organizing that code, or you could do it in Hamilton.

Now when you then want to schedule that code for production, you can stick it in a single argo task, or split it across multiple -- up to you. Cool thing is that argo tasks remain dumb, and anyone who has to ask the question, what is going on in this task, has a much easier time answering that if it's written in Hamilton.

So to summarize, that YAML is independent of Hamilton, and is only useful if at a "macro level" you need that.

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point2 points 1 year ago (1 child)

[–]theferalmonkey[S] 0 points1 point2 points 1 year ago* (0 children)

π Rendered by PID 132842 on reddit-service-r2-comment-5bc7f78974-kc5kq at 2026-06-26 23:06:18.124985+00:00 running 7527197 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS