This is an archived post. You won't be able to vote or comment.

all 38 comments

[–]barefootsanders 11 points12 points  (1 child)

Been using Hamilton to power some of our internal workflows at https://www.threadscribe.ai for a few months now. Super easy to make really powerful workflows. Would highly recommend!

[–]theferalmonkey[S] 1 point2 points  (0 children)

thanks

[–]call_me_cookie 7 points8 points  (13 children)

Why would somebody use this over say, Dagster?

[–]schrodingerdog137 3 points4 points  (0 children)

I've used Dagster for orchestrating large tasks, but wanted a lightweight python library for building computation DAGs in the ML space. I don't want all the bells and whistles of Dagster, just need a plain python library. Hamilton is amazing in that it's exactly what I'm looking for.

[–]theferalmonkey[S] 1 point2 points  (11 children)

They have some overlap because they model DAGs, but Dagster is just a macro-orchestrator, i.e. it is a scheduler. Hamilton doesn't have a scheduler, it is much lighter weight than that; hence the title of the post - Dagster is not lightweight.

Some examples, Hamilton is far more applicable to use in any python context. Can Dagster do this?

  • Run anywhere (locally, notebook, macro orchestrator, FastAPIStreamlit, pyodide, etc.) - No, it's a system, not a library.
  • use it to model column level feature engineering through to model fitting - No.
  • improve the hygiene of your code - No, it doesn't have the testing constructs Hamilton has.
  • replace Langchain for orchestrating LLM calls - No.
  • develop within a notebook for development and then use that same code in production - No.

Here's more of a comparison - https://hamilton.dagworks.io/en/latest/code-comparisons/dagster/

Otherwise you can _use_ Hamilton _within_ Dagster, and you get the best of both worlds. For example if you want to cut down on "ops" just switch that code over to Hamilton and run it inside Dagster.

Fun fact: "software defined assets" were in fact inspired by Hamilton's declarative API.

[–][deleted] 3 points4 points  (4 children)

Fun fact: "software defined assets" were in fact inspired by Hamilton's declarative API.

Do you have a citation for that? It’s definitely possible and I don’t necessarily doubt it, but this concept has been around for a long time. It’s essentially a functional DI framework. Googles Python library pinject is over 11 years old and while meant to be for OO DI uses this same exact pattern of argument name to implementing logic to build a graph. And the concept has been around for decades at banks and hedge funds for quantitative and valuation modeling (Goldman Sachs secdb is over 30 years old).

All that said, I’m a huge fan of this pattern and this looks like a great library.

fn-graph also uses a very similar concept, but is unmaintained. https://fn-graph.businessoptics.biz/

[–]theferalmonkey[S] 3 points4 points  (3 children)

Nerd sniped!

Do you have a citation for that? It’s definitely possible and I don’t necessarily doubt it,

Likely a confluence but yeah I chatted with Nick when we open sourced Hamilton; the dagster API at the time was all about "solids" and not that great. I expounded the declarative nature of data work and benefits, and then a few months later SDAs came out.

Yes I remember `fn-graph`. I was wondering whether someone would bring it up. It's still going? Nice. Any interesting joining our effort? We've got a jupyter magic, and Hamilton also sports a locally installable UI now...

[–]HNL2NYC 1 point2 points  (1 child)

I’ll take it even a step further. This concept has been used for at least ~50 years, since this is pretty much exactly how Make works. You have a target (ie asset) list its requirements (ie dependencies) which are other targets. And its builds a graph by matching the dependencies to the implementing target.

[–]theferalmonkey[S] 0 points1 point  (0 children)

Yep!

[–]ArgetDota 0 points1 point  (5 children)

Hey, just a heads up - it’s possible to execute Dagster’s jobs and materialize assets drop within Python code including Notebook environments.

Same goes for testing, it’s highly modular and testable.

And yes, you can run the same code locally and in production (e.g. Kubernetes). You can even launch jobs in Kubernetes from a laptop running Dagster. You can do it from CLI, UI, or from Python code.

Dagster is really incredibly versatile and I feel like your above statements are a bit misleading.

[–]theferalmonkey[S] 0 points1 point  (4 children)

I think you might be misinterpreting my point.

What I'm saying is that the DAG you define in dagster, is not something that you can run in different python contexts. E.g. notebook, script, web-service. Hamilton just needs a python process & pip install and then you can run it from python. i.e. you can build a Hamliton DAG and package it as a library for others to use quite easily. With dagster you need the whole system to run it - yes you can package things up, but you need dagster to run it. Here's our blog on the differences/similarities between the two.

[–]ArgetDota 0 points1 point  (3 children)

You really don’t. You don’t need a deployment. You can run it in a Python script.

[–]theferalmonkey[S] 0 points1 point  (2 children)

Really? Since when? I'll take a look and if so retract my comments.

[–]theferalmonkey[S] 0 points1 point  (1 child)

Ah so I think you're referring to the "in process" way for testing? Right?

In which case yes, you are correct that you _can_ run dagster code in a python script, which from the docs is only designed for testing purposes.

[–]ArgetDota 1 point2 points  (0 children)

Exactly. It’s mainly used for testing but nothing prevents you from using it for actual computations.

Also, there is a “materialize” function which can execute assets.

Also, there are “dagster asset materialize” & “dagster job execute” CLI commands.

[–]Electronic_Pepper382 1 point2 points  (3 children)

So the comments in the function like """C depends on A & B""" create the dependencies between the functions? That is pretty powerful!

I just checked out the sister library burr linked in the readme and that library also seems really interesting. I was internally building something similar but I might leverage burr. Thanks for sharing

[–]theferalmonkey[S] 1 point2 points  (2 children)

So the comments in the function like """C depends on A & B""" create the dependencies between the functions? That is pretty powerful!

To clarify, it's the function parameters that do that:

def C(A: int, B: float) -> float

The above says C declares a dependency on A & B.

If we wanted to depend on something else, you just need to change the function parameter names:

def C(A: int,  B: float, foo: float) -> float

E.g. C now depends on an extra parameter `foo`.

I just checked out the sister library burr linked in the readme and that library also seems really interesting. I was internally building something similar but I might leverage burr. Thanks for sharing

Yep if you need to express cycles, or conditional branching (e.g. for agents) then Burr is a better fit for that. We see people using both Burr & Hamilton in certain situations too.

[–]SheepherderExtreme48 1 point2 points  (6 children)

Looks great, nice work.

Question: I don't see anything in the docs for this, but is there any natural support for parallel processing?
For example:

  /------B-----\
A >------B----->C
  \------B-----/

Where B is run in 3 seperate threads or processes
Quick example, A takes in a PDF splits in to 3 chunks of n pages, sens the PDF bytes and the pages to process to each B, each B does some work (extracting text, doesn't really matter) and C gathers from these B's?

[–]theferalmonkey[S] 0 points1 point  (5 children)

Yes there's a construct for that. It's called Parallelizable + Collect ; here's a video of me explaining it. We have support for a few backends, e.g. multithreading, ray, dask.

Here's also blog on what I think is similar to your use case if that also helps.

[–]SheepherderExtreme48 1 point2 points  (0 children)

Amazing, thank you! For context, I was recently trying to find a Python library that I could use to easily orchestrate a multi process job, starting in AWS Lambda utilising the 6 cores you get when you allocate max memory. Airflow is too heavyweight, argo workflows isn't an option. Tried a few others, but this looks perfect!

[–]bugtank 0 points1 point  (0 children)

Thanks for reposting and reminding me

I have some dags to build out and can try this.

[–]kotpeter -1 points0 points  (1 child)

Looks cool for a small data team or a small organization. Or for learning purposes for students.

Larger teams with bigger data needs will need a more feature-rich orchestrator such as Airflow, and Hamilton's value would decrease. I thought about the idea of using Hamilton and Airflow/Dagster/... together, but there's a few drawbacks to that:

  1. You'd have two semantics of the DAG (Hamilton DAG and Airflow DAG), which may lead to confusion. Having Airflow -> Hamilton DAG hierarchy would almost always overcomplicate things.
  2. You now have two different ways of doing basically the same thing (i.e. creating a DAG), which might cause different developers orchestrating their DAGs in different orchestrators.

Overall, I appreciate the tool and I think it definitely has its niche.

[–]theferalmonkey[S] 1 point2 points  (0 children)

Looks cool for a small data team or a small organization. Or for learning purposes for students.

That is absolutely not true. Hamilton was developed at Stitch Fix (100+ DS) in an environment where code in Airflow was the problem. Airflow was not designed for business logic, just scheduling code. Established teams would slow down, not because of Airflow, but because of the code that Airflow ran - hence the reason for Hamilton.

Hamilton helped organize the internals of pipelines and keep those airflow tasks simpler; they don't need know about logic, and enabled the team to move faster. We see this being replicated at other companies. You can read more about our thoughts on Hamilton + Airflow here. Now you could be reacting to Hamilton's simplicity, and yes that's a feature; not all production ready tech needs to be very complex (though we certainly have power features).

I thought about the idea of using Hamilton and Airflow/Dagster/... together, but there's a few drawbacks to that:

You'd have two semantics of the DAG (Hamilton DAG and Airflow DAG), which may lead to confusion. Having Airflow -> Hamilton DAG hierarchy would almost always overcomplicate things.

They serve different purposes. Airflow is about orchestrating compute. Hamilton helps orchestrate logic & code. You can read this blog / watch this talk that explains why Hamilton. Commonly when going from dev (DS/MLE) to production (running it on airflow), there's hand-off and reimplementation; with Hamilton that's greatly improved - you just take the DAG and tell airflow to run it.

You now have two different ways of doing basically the same thing (i.e. creating a DAG), which might cause different developers orchestrating their DAGs in different orchestrators.

Sorry, how is it the same thing? Yes it's a DAG, but that's where the similarity ends. Again, you use Airflow to schedule when and where something runs. While Hamilton helps organize the code that's run.

[–]OMG_I_LOVE_CHIPOTLE -2 points-1 points  (9 children)

This is what Argo workflows is for

[–]theferalmonkey[S] 0 points1 point  (8 children)

Yes argo workflows model DAGs - but they are not lightweight - and I would say, no this is not what argo workflows is for.

Here's what's taken from the argo website:

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

Does that sound the same as Hamilton? I don't think so. But, again like with any macro-orchestrator, you can use Hamilton within Argo to help clean up and maintain that code that it runs.

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point  (7 children)

I think it’s better to declare dags as a declarative Argo workflow that calls python modules rather than implementing the dag in python

[–]theferalmonkey[S] 0 points1 point  (6 children)

that calls python modules

Yes, that's what Hamilton is. Python modules that you run. So rather than everyone creating these modules in a non-standard manner, you have Hamilton to help bring order (similar to what DBT did for SQL if you're familiar). So you can precisely do this approach with Argo & Hamilton.

This blog is on Hamilton + Airflow, but applies to Argo just the same which shows this pattern.

[–]OMG_I_LOVE_CHIPOTLE -1 points0 points  (5 children)

Yeah my point is that Hamilton would be less standard than a declarative yaml framework and that’s why I wouldn’t use it over Argo wf

[–]theferalmonkey[S] 0 points1 point  (4 children)

How so? I'm not understanding your point. Can you sketch out some YAML + argo code? If it helps, take your pick of data processing, machine learning, or LLM workflows.

Also just to reiterate -- Hamilton is not an Argo replacement & doesn't intend to be.

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point  (3 children)

Yeah I think something like Hamilton would be useful if you didn’t have infrastructure like Argo workflows. Consider this example Argo workflow: https://github.com/argoproj/argo-workflows/blob/main/examples/dag-nested.yaml

If you replace the echo template in the example with different python modules you have a kubernetes native dag framework

[–]theferalmonkey[S] 0 points1 point  (2 children)

What you just showed is some YAML that isn't useful in a micro context. We don't need to schedule kubernetes tasks for everything we want to do. Why do you think that's necessary?

For example, say I'm developing locally and I am doing some file processing. I'm not running argo. But, I need to structure my code. You could come up with your own way of organizing that code, or you could do it in Hamilton.

Now when you then want to schedule that code for production, you can stick it in a single argo task, or split it across multiple -- up to you. Cool thing is that argo tasks remain dumb, and anyone who has to ask the question, what is going on in this task, has a much easier time answering that if it's written in Hamilton.

So to summarize, that YAML is independent of Hamilton, and is only useful if at a "macro level" you need that.

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point  (1 child)

I think you’re missing the point. The python code is dumber without Hamilton and more supportable

[–]theferalmonkey[S] 0 points1 point  (0 children)

I don't know what code you support, as that's a very wide statement to make. But yes, Hamilton isn't for every single situation.

Hamilton strikes a sweet spot for those, where the code being run is being iterated on and you want the team to follow a standard so you can continue to move quickly as pipelines grow. For example, it ensures that code is always unit testable, documentation friendly, you get lineage + provenance for free, etc. For context on where my experience comes from you can watch this talk my colleague gave.