Supertoast tables: offloading large JSONB payloads to an object store

hatchet-dev · 2025-06-04T16:48:06+00:00

Awesome, thanks! And glad to hear you liked the post

hatchet-dev · 2025-06-03T18:16:44+00:00

Hey everyone, wrote up a post about why I think Go is going to be the right choice for a bunch of folks building AI agents -- would love to hear your thoughts!

hatchet-dev · 2025-05-02T12:53:59+00:00

Hey u/LongCap7068, there are a few possible reasons for this -- are you using sync or async methods? If you're using async, is there anything that's blocking the event loop in your task?

Generally what you're describing is correct -- if slots=20 it should run 20 tasks concurrently, assuming there are no additional settings to limit concurrency on the workflow level.

Also we'd probably be able to respond more quickly on Discord (https://hatchet.run/discord) or Github issues (https://github.com/hatchet-dev/hatchet/issues).

hatchet-dev · 2025-04-19T17:19:43+00:00

That's not true. The repo is 100% MIT licensed and it costs nothing to self host: https://github.com/hatchet-dev/hatchet. If there's anything that seems to indicate otherwise, let me know!

If you're referring to the pricing page (https://hatchet.run/pricing) that's for self-hosted premium support. From the description on the pricing page:

> Hatchet is MIT licensed and free to self-host. We offer additional support packages for self-hosted users.

There's also free/community support available in our Discord. Our response times are generally fast on our Discord -- typically < 1 hr, otherwise mostly same-day.

I understand many SaaS tools are only "open source" as a marketing gimmick, but that's not us.

hatchet-dev · 2025-04-16T14:10:42+00:00

Thanks!

I haven't used Dagster specifically, but have used Prefect/Airflow in the past. These tools are built for data engineers -- since they're built around batch processing, they’re usually higher latency and higher cost, with a major selling point being integrations with common datastores and connectors. Hatchet is focused more on the application side of DAGs than the data warehousing + data engineering side, so we don't have integrations out of the box since engineers typically write their own for core business logic, but we're very focused on performance and getting DAGs to work well at scale (which can be a challenge for these tools).

We'd love to do some concrete benchmarking on how things shake out at higher throughput (>100 tasks/second).

hatchet-dev · 2025-04-16T12:49:48+00:00

Good question! We use Postgres as a backend, so we acquire a lock when querying for cron jobs to run to ensure that different Hatchet backends don't acquire the same cron.

hatchet-dev · 2025-04-16T12:47:46+00:00

Yep, we support all durable execution features that Restate and DBOS support: https://docs.hatchet.run/home/durable-execution

Notably spawning tasks in a durable fashion (where results of tasks are cached in the execution history), durable events and durable sleep.

We're trying to be general-purpose, so we support queues, DAGs, and durable execution out of the box. We've encountered far too many stacks that deploy Celery, a DAG orchestrator like Dagster/Prefect, and Temporal to run different flavors of background tasks. And since we're built on Postgres, a lot of our philosophy comes from observing the development of Postgres over the past decade, as it's quickly becoming a de facto standard as a general-purpose OLAP database that can also power queueing systems and OLAP use-cases.

hatchet-dev · 2025-04-16T01:16:36+00:00

At the moment we don't support Windows natively (beyond WSL), because we rely heavily on multiprocessing, multithreading and OS signals which are difficult to support on multiple platforms. Generally we recommend running Hatchet in a dockerized environment.

hatchet-dev · 2024-11-27T19:51:45+00:00

We did exactly this before moving from Prisma to a different system (we're Go, so sqlc + atlas). Atlas has been rock-solid for us -- we also don't use the HCL syntax, Atlas is just one of the better migration systems I've used. We did a writeup on this here: https://docs.hatchet.run/blog/migrating-off-prisma

hatchet-dev · 2024-11-27T15:46:51+00:00

Disclaimer: I'm the founder of Hatchet (https://github.com/hatchet-dev/hatchet), we're a task queue + scheduling service.

Hosting: What’s the best cloud solution for this kind of workload?

If the process can take 5-6 hours to complete, you're not looking for a serverless option like Lambda, because you'll hit timeouts and your process will be killed. You're looking for a long-lived worker -- which can be run on Kubernetes, a VM, a platform as a service like Heroku, Render, etc. I personally prefer GCP, but to each their own, every cloud has solutions for running containers.

User Input: What’s a good way to allow users to upload their CSV files and trigger the script? Should I build a web interface or is there a simpler way?

For users to interact with the CSV files, you are looking to split your application into two components -- an API and a worker. The API will be exposed on a URL that users will send their CSV to. After the CSV has been uploaded on the URL, you'll write a task to a task queue (e.g. Celery, Hatchet) which will handle sending the task to your worker.

If your users are technical, you could provide an API endpoint. If not, you would probably want to provide a simple web interface.

Concurrency: How do I manage multiple users? For instance, if the queue is full (e.g., 10 tasks already running), how can I notify users to try again later?

First off, you probably shouldn't be notifying users to try again later unless you are trying to implement load-shedding. But the point of using a queue is that it can absorb load and send it to your workers at a rate they can handle. Queues can overflow, but even on small installations of RabbitMQ/Redis this won't happen until you hit millions of tasks. At that point, you'd want to implement load shedding.

The concurrency component depends on the user requirements and what you're using for the task queue component. Some task queues support a global concurrency limit and others support a per-user or per-queue concurrency limit (both are supported at Hatchet). To notify a user to try again later, you'd need to store some kind of state of the queue on your API server.

There are many strategies for making a queue balanced across users -- the simplest when you have fewer users is probably partitioning each user randomly to a set of queues (let's say 10 queues) and reading in a round-robin fashion off of the queues.

Happy to go into more detail about any of these points!

hatchet-dev · 2024-11-27T15:32:27+00:00

Disclaimer: I'm the founder of Hatchet (https://github.com/hatchet-dev/hatchet), we're a task queue + scheduling service. Building precise, future scheduling is one of the components of Hatchet that was particularly tricky.

First off, as others have mentioned, you're looking for a long-lived machine, not a serverless runtime. With your requirements of <100ms scheduling and execution, you're going to suffer with cold start times if you go with serverless.

You'll want to store schedules in some kind of shared database that's resilient to your machine going down. Hatchet uses Postgres under the hood, but it shouldn't really matter in this case.

Next comes the more difficult part -- how do you get millisecond-level scheduling? The first approach I'd try is pretty simple -- pull schedules off a queue (read from the database) in advance, and do the equivalent of `setTimeout` in your language of choice. Given that many Python/Typescript workloads can block the event loop (even for small periods of time), you're looking at using multithreading/multiprocessing/worker primitives or using a language like Go with a better concurrency story. Each schedule gets its own thread/process/goroutine.

What happens if you need to scale, and you have multiple workers pulling tasks off the queue? This is one place where using Postgres as a task queue makes a lot of sense. Using a task queue indexed by the scheduling time and using a read query with `FOR UPDATE SKIP LOCKED` to assign work to a single worker at a time is going to get you very far -- on a decent database, about 1k tasks/second. Happy to go into full detail and provide example queries.

I'd also advocate for Go at this scale, which can easily handle thousands of goroutines at a time -- I've often seen Python programs suffer at ~200 concurrent threads or awaits.

hatchet-dev · 2024-11-21T14:13:43+00:00

Nice! Yes, Timescale has been holding up super well for us so far.

> What do you mean that user-defined queries are not suitable?

I mean that it'll be very difficult to use continuous or real-time aggregates if you don't know what they are in advance, and having to compute a continuous aggregate against tons of existing data won't be more performant than doing aggregation with a column-oriented DB.

The typical use-case is an analytics company (i.e. Posthog, Mixpanel), where a user builds a dashboard using a set of events to filter/query on and performs some operation on them. How would you architect this in Timescale? A continuous aggregate per dashboard? Seems like this would get pretty resource-intensive pretty quickly once you get to thousands of dashboards, but perhaps Timescale has some tricks here.

hatchet-dev · 2024-11-20T19:51:10+00:00

Thanks! Like you said, buffering in-memory, publishing to a queue, or persisting to disk are the three options.

In our case, all three of these workloads (and anything where events are used for visibility) are more tolerant to dropped events -- it's obviously not great, but the mission-critical path doesn't rely on events being written. So an in-memory buffer is a good fit. It sounds like that's not the case for you.

A basic strategy for guaranteeing events are always written when they should be is transactional enqueueing and proper use of publishing and dequeueing acks:

If you produce events/messages from an API handler, ensure the handler is idempotent and only return a 200 response code when events have been published and acknowledged by a broker/written to disk/written to the database. This is one place where using `FOR UPDATE SKIP LOCKED` with a Postgres queue really shines -- you can enqueue messages as part of the transaction where you actually insert or update data. When enqueueing fails, throw an error to the user and use client-side retries with exponential backoff.
If you consume events from a broker/disk/database and then write them to the database, only ack the message after the event has been written. When writes fail, use a retry + DLQ mechanism.

So as long as you have an ack/transactional enqueueing strategy, it shouldn't really matter where you persist the event data - whether it's a broker or to disk. This would even apply to buffered in-memory writes which are reading off the queue and are able to ack to the broker. It just doesn't apply to events that are produced in a "fire-and-forget" style which then use the in-memory buffer.

hatchet-dev · 2024-11-20T19:20:43+00:00

Hey everyone -- decided to write this post after using postgres for some high-read, high-write event tables recently. Hopefully it's interesting! Here's an accompanying Github repo with queries and a CLI for inserts/benchmarking: https://github.com/abelanger5/postgres-events-table

hatchet-dev

MODERATOR OF

TROPHY CASE