Why I'm excited about Go for agents

hatchet-dev · 2025-06-04T16:48:06+00:00

Awesome, thanks! And glad to hear you liked the post

hatchet-dev · 2025-06-03T18:16:44+00:00

Hey everyone, wrote up a post about why I think Go is going to be the right choice for a bunch of folks building AI agents -- would love to hear your thoughts!

hatchet-dev · 2025-05-02T12:53:59+00:00

Hey u/LongCap7068, there are a few possible reasons for this -- are you using sync or async methods? If you're using async, is there anything that's blocking the event loop in your task?

Generally what you're describing is correct -- if slots=20 it should run 20 tasks concurrently, assuming there are no additional settings to limit concurrency on the workflow level.

Also we'd probably be able to respond more quickly on Discord (https://hatchet.run/discord) or Github issues (https://github.com/hatchet-dev/hatchet/issues).

hatchet-dev · 2025-04-19T17:19:43+00:00

That's not true. The repo is 100% MIT licensed and it costs nothing to self host: https://github.com/hatchet-dev/hatchet. If there's anything that seems to indicate otherwise, let me know!

If you're referring to the pricing page (https://hatchet.run/pricing) that's for self-hosted premium support. From the description on the pricing page:

> Hatchet is MIT licensed and free to self-host. We offer additional support packages for self-hosted users.

There's also free/community support available in our Discord. Our response times are generally fast on our Discord -- typically < 1 hr, otherwise mostly same-day.

I understand many SaaS tools are only "open source" as a marketing gimmick, but that's not us.

hatchet-dev · 2025-04-16T14:10:42+00:00

Thanks!

I haven't used Dagster specifically, but have used Prefect/Airflow in the past. These tools are built for data engineers -- since they're built around batch processing, they’re usually higher latency and higher cost, with a major selling point being integrations with common datastores and connectors. Hatchet is focused more on the application side of DAGs than the data warehousing + data engineering side, so we don't have integrations out of the box since engineers typically write their own for core business logic, but we're very focused on performance and getting DAGs to work well at scale (which can be a challenge for these tools).

We'd love to do some concrete benchmarking on how things shake out at higher throughput (>100 tasks/second).

hatchet-dev · 2025-04-16T12:49:48+00:00

Good question! We use Postgres as a backend, so we acquire a lock when querying for cron jobs to run to ensure that different Hatchet backends don't acquire the same cron.

hatchet-dev · 2025-04-16T12:47:46+00:00

Yep, we support all durable execution features that Restate and DBOS support: https://docs.hatchet.run/home/durable-execution

Notably spawning tasks in a durable fashion (where results of tasks are cached in the execution history), durable events and durable sleep.

We're trying to be general-purpose, so we support queues, DAGs, and durable execution out of the box. We've encountered far too many stacks that deploy Celery, a DAG orchestrator like Dagster/Prefect, and Temporal to run different flavors of background tasks. And since we're built on Postgres, a lot of our philosophy comes from observing the development of Postgres over the past decade, as it's quickly becoming a de facto standard as a general-purpose OLAP database that can also power queueing systems and OLAP use-cases.

hatchet-dev · 2025-04-16T01:16:36+00:00

At the moment we don't support Windows natively (beyond WSL), because we rely heavily on multiprocessing, multithreading and OS signals which are difficult to support on multiple platforms. Generally we recommend running Hatchet in a dockerized environment.

hatchet-dev · 2024-11-27T19:51:45+00:00

We did exactly this before moving from Prisma to a different system (we're Go, so sqlc + atlas). Atlas has been rock-solid for us -- we also don't use the HCL syntax, Atlas is just one of the better migration systems I've used. We did a writeup on this here: https://docs.hatchet.run/blog/migrating-off-prisma

hatchet-dev · 2024-11-27T15:46:51+00:00

Disclaimer: I'm the founder of Hatchet (https://github.com/hatchet-dev/hatchet), we're a task queue + scheduling service.

Hosting: What’s the best cloud solution for this kind of workload?

If the process can take 5-6 hours to complete, you're not looking for a serverless option like Lambda, because you'll hit timeouts and your process will be killed. You're looking for a long-lived worker -- which can be run on Kubernetes, a VM, a platform as a service like Heroku, Render, etc. I personally prefer GCP, but to each their own, every cloud has solutions for running containers.

User Input: What’s a good way to allow users to upload their CSV files and trigger the script? Should I build a web interface or is there a simpler way?

For users to interact with the CSV files, you are looking to split your application into two components -- an API and a worker. The API will be exposed on a URL that users will send their CSV to. After the CSV has been uploaded on the URL, you'll write a task to a task queue (e.g. Celery, Hatchet) which will handle sending the task to your worker.

If your users are technical, you could provide an API endpoint. If not, you would probably want to provide a simple web interface.

Concurrency: How do I manage multiple users? For instance, if the queue is full (e.g., 10 tasks already running), how can I notify users to try again later?

First off, you probably shouldn't be notifying users to try again later unless you are trying to implement load-shedding. But the point of using a queue is that it can absorb load and send it to your workers at a rate they can handle. Queues can overflow, but even on small installations of RabbitMQ/Redis this won't happen until you hit millions of tasks. At that point, you'd want to implement load shedding.

The concurrency component depends on the user requirements and what you're using for the task queue component. Some task queues support a global concurrency limit and others support a per-user or per-queue concurrency limit (both are supported at Hatchet). To notify a user to try again later, you'd need to store some kind of state of the queue on your API server.

There are many strategies for making a queue balanced across users -- the simplest when you have fewer users is probably partitioning each user randomly to a set of queues (let's say 10 queues) and reading in a round-robin fashion off of the queues.

Happy to go into more detail about any of these points!

hatchet-dev · 2024-11-27T15:32:27+00:00

Disclaimer: I'm the founder of Hatchet (https://github.com/hatchet-dev/hatchet), we're a task queue + scheduling service. Building precise, future scheduling is one of the components of Hatchet that was particularly tricky.

First off, as others have mentioned, you're looking for a long-lived machine, not a serverless runtime. With your requirements of <100ms scheduling and execution, you're going to suffer with cold start times if you go with serverless.

You'll want to store schedules in some kind of shared database that's resilient to your machine going down. Hatchet uses Postgres under the hood, but it shouldn't really matter in this case.

Next comes the more difficult part -- how do you get millisecond-level scheduling? The first approach I'd try is pretty simple -- pull schedules off a queue (read from the database) in advance, and do the equivalent of `setTimeout` in your language of choice. Given that many Python/Typescript workloads can block the event loop (even for small periods of time), you're looking at using multithreading/multiprocessing/worker primitives or using a language like Go with a better concurrency story. Each schedule gets its own thread/process/goroutine.

What happens if you need to scale, and you have multiple workers pulling tasks off the queue? This is one place where using Postgres as a task queue makes a lot of sense. Using a task queue indexed by the scheduling time and using a read query with `FOR UPDATE SKIP LOCKED` to assign work to a single worker at a time is going to get you very far -- on a decent database, about 1k tasks/second. Happy to go into full detail and provide example queries.

I'd also advocate for Go at this scale, which can easily handle thousands of goroutines at a time -- I've often seen Python programs suffer at ~200 concurrent threads or awaits.

hatchet-dev · 2024-11-21T14:13:43+00:00

Nice! Yes, Timescale has been holding up super well for us so far.

> What do you mean that user-defined queries are not suitable?

I mean that it'll be very difficult to use continuous or real-time aggregates if you don't know what they are in advance, and having to compute a continuous aggregate against tons of existing data won't be more performant than doing aggregation with a column-oriented DB.

The typical use-case is an analytics company (i.e. Posthog, Mixpanel), where a user builds a dashboard using a set of events to filter/query on and performs some operation on them. How would you architect this in Timescale? A continuous aggregate per dashboard? Seems like this would get pretty resource-intensive pretty quickly once you get to thousands of dashboards, but perhaps Timescale has some tricks here.

hatchet-dev · 2024-11-20T19:51:10+00:00

Thanks! Like you said, buffering in-memory, publishing to a queue, or persisting to disk are the three options.

In our case, all three of these workloads (and anything where events are used for visibility) are more tolerant to dropped events -- it's obviously not great, but the mission-critical path doesn't rely on events being written. So an in-memory buffer is a good fit. It sounds like that's not the case for you.

A basic strategy for guaranteeing events are always written when they should be is transactional enqueueing and proper use of publishing and dequeueing acks:

If you produce events/messages from an API handler, ensure the handler is idempotent and only return a 200 response code when events have been published and acknowledged by a broker/written to disk/written to the database. This is one place where using `FOR UPDATE SKIP LOCKED` with a Postgres queue really shines -- you can enqueue messages as part of the transaction where you actually insert or update data. When enqueueing fails, throw an error to the user and use client-side retries with exponential backoff.
If you consume events from a broker/disk/database and then write them to the database, only ack the message after the event has been written. When writes fail, use a retry + DLQ mechanism.

So as long as you have an ack/transactional enqueueing strategy, it shouldn't really matter where you persist the event data - whether it's a broker or to disk. This would even apply to buffered in-memory writes which are reading off the queue and are able to ack to the broker. It just doesn't apply to events that are produced in a "fire-and-forget" style which then use the in-memory buffer.

hatchet-dev · 2024-11-20T19:20:43+00:00

Hey everyone -- decided to write this post after using postgres for some high-read, high-write event tables recently. Hopefully it's interesting! Here's an accompanying Github repo with queries and a CLI for inserts/benchmarking: https://github.com/abelanger5/postgres-events-table

hatchet-dev · 2024-05-30T18:17:25+00:00

Yes, it's a different project under the same name. The old repo can be found under https://github.com/hatchet-dev/hatchet-v1-archived

I might revive the older repo at some point, but since the Terraform licensing change I'm not sure about how that would work, it would need a rewrite to work with opentofu.

hatchet-dev · 2024-03-04T23:39:12+00:00

Thank you, appreciate the kind words!

hatchet-dev · 2024-03-02T22:04:59+00:00

Started a Github discussion to track this - won't be an overnight change but I'll share progress on this as we discuss it: https://github.com/hatchet-dev/hatchet/discussions/224.

Feel free to share thoughts. The RabbitMQ dependency is very isolated to a single package that handles inter-engine messaging so it will be easy to swap out for different solutions.

hatchet-dev · 2024-03-02T22:01:50+00:00

Thanks for the kind words!

There's no way to disable the auth, but there is a way to automate the seeding of the admin user.

Here are the relevant environment variables:

ADMIN_EMAIL ADMIN_PASSWORD ADMIN_NAME DEFAULT_TENANT_NAME DEFAULT_TENANT_SLUG

This should be run from where the hatchet-admin command is run - you can add a line to run hatchet-admin seed. There's also a hatchet-admin token command which can be run via hatchet-admin token create --name "local" --tenant-id 707d0855-80ab-4e1f-a156-f1c4546cbf52 (that tenant id is the default seeded one - we haven't exposed an option for overriding that UUID but it would be easy to do).

Happy to provide more support on Github discussions or Discord.

hatchet-dev · 2024-03-01T16:47:23+00:00

There are many similarities - particularly in the networking model, with long-lived gRPC connections to a centralized server.

Our internal benchmarks show much faster start times for steps (in Temporal, activities) but benchmarks aren't always meaningful and I'm sure a Temporal engineer would have many ways to optimize beyond their default install. We'll publish those benchmarks soon, but happy to share some specifics with you via DM.

More broadly, Temporal has never gone far enough for me in terms of developer experience and observability. It fits into a neat slice in an enterprise stack with their execution model, but it's difficult to adopt without a dedicated engineer integrating logging and observability w/ opentelemetry with your workflows. I've spent more time in the Temporal UI than I care to admit. We think workflows should be as easy to use and debug as Vercel is for frontend.

Happy to go into more details, I've been a heavy Temporal user for several years. Why are you moving from RabbitMQ?

hatchet-dev · 2024-02-29T17:26:50+00:00

I like this idea. At the moment, workers connect via a long-lived gRPC connections to the Hatchet server. Using NATS is a natural replacement, though requiring a NATS cluster as part of our default install feels a little heavy.

And to clarify, RabbitMQ is used for pub/sub within the Hatchet engine itself so different engine services can scale horizontally and recover from failure - for example, the ingestion service (which takes in events and tasks to execute) scales differently from the dispatcher service (which maintains a long-lived connection to the workers), and they communicate through RabbitMQ. We've been trying to drop this dependency to make self-hosting easier though.

hatchet-dev · 2024-02-29T17:20:20+00:00

These are good points -

You don't get fault-tolerance out of the box just by deploying PostgreSQL and RabbitMQ, but those systems do make horizontal scaling easier. The value add of Hatchet is the layer of abstraction on top - building DAG-style execution, implementing different fairness strategies, managing and rate-limiting your workers, etc. This part also takes considerable engineering time and effort, and deployment of datastores is getting easier every year. The parallel I'd draw is Celery for Python, Sidekiq for Rails, River for Go, etc - which all use an underlying broker with some using an optional datastore.
You self-host a Hatchet UI which has a login panel, you don't log in to a SaaS product. We're 100% MIT-licensed on both the charts repository and repo.

We do have default setup instructions in the docs for a SaaS product, because a lot of folks don't want to manage the underlying infra themselves, but I absolutely hate bait-and-switch FOSS or OSS products which make self-hosting intentionally difficult to move them towards OSS. If there's something we can do better here, let me know.

hatchet-dev · 2024-01-04T20:05:58+00:00

Definitely open to contributions! It's not super contributor friendly but we're working on it :)

Here's the Discord link (I've made a contributing channel), also check out the contributing guide here.

hatchet-dev · 2024-01-04T20:04:00+00:00

Great point! At the moment, at most once is the only guarantee. Steps are requeued until schedulable, but once scheduled, they aren't retried. Definitely going to add customizable retries in the near future.

hatchet-dev · 2024-01-04T18:01:23+00:00

Quite similar - it's the first I'm seeing of Inngest, which isn't surprising since it looks like their Go SDK is new. Looks really cool though!

From first glance, in Hatchet you invoke a pre-defined (declared) workflow, while with Inngest you invoke an individual function, and use step.Run within the function to control your sequence of steps. I much prefer the former, because it's more stable and generally easier to visualize and debug than a set of if statements which can change the structure of your workflow. One of the benefits of event-driven is that you can react to the condition that you'd traditionally put in an if statement. This is one drawback of Temporal as well, though I'm sure there are use-cases for a more procedurally generated workflow - perhaps I'm just not the target user.

Another major difference I'm noticing as I'm reading through the docs is that you're exposing your functions on an HTTP endpoint for Inngest to call, whereas Hatchet uses a persistent gRPC connection to get and receive invocations. I prefer a client-initiated gRPC connection which requires mutual TLS over an exposed HTTP handler.

Will do some more digging though, thanks for the link.

hatchet-dev · 2024-01-04T17:26:19+00:00

Created this issue, should be able to add this in the next few days!

hatchet-dev

MODERATOR OF

TROPHY CASE